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ABSTRACT 



AN INVESTIGATION OF THE FEASIBILITY OF USING THE 
THREE-PARAMETER MODEL FOR 
FLORIDA'S STATEWIDE ASSESSMENT TESTS 

The three paramett am response theorv model, a 
subject of research for th ^ast three decades, has recently 
been succe.'ssf ully implemented in several large-scale testing 
programs. Other programs, including at least two statewide 
assessment programs, are currently considering its adoption. 
The purpose of this study was to investigate the feasibility 
of its use in Florida's minimum competency testing program. 
If it were to prove feasible for use. several aspects of 
Florida's SSAT progran could be improved. The inclusion of 
parameters for guessing and for item discrimination make 
possible the development of hisrb i ^ ef f icient tests and more 
precise measurement. 

An i nves t i gation of feasibility should proceed in 
several different phases. We have outlined a possible 
structuring of a feasibility investigation in a previous 
paper ^H::ils and Beard. 1984). This paper presents the 
results of the phase one studies proposed in that paper. The 
studies address the following questions. 

1. Will existing IPT computer programs work 
satisfactorily using the SSAT-II data? 

2. Is the assumption of unidimensionality valid for 
the SSAT-II data? 

3. Do the two- and three-parameter models fit the SSAT- 
II data better than the Rasch or one-parameter model? 

4. A^e the guessing (c) parameters estimable for the 
SSAT-I7 data (using the LOGIST 4 computer program)? 

5. How many examinees are needed to estimate the 
parameters? 

6. Can the parameter estimates be improved by 
oversampling the lower end of the ability distribution? 
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CHAPTER 1 



INTRODUCTION 

The three parameter item response theory model, a 
subject of iresearch for the last three decades, has recently 
been successfully implemented in several large-scale testing 
programs. Other programs, including at least two statewide 
assessment programs, a^i currently considering its adoption. 
The purpose of this study was to investigate the feasibility 
of its use in Flc^ida's minimum competency testing program. 
If it were to prove feasible for use, several aspects of 
Florida's SSAT progra.-n could be improved. The inclusion of 
parameters for guessing and for item discrimination make 
possible the development of highly efficient tests and more 
prec ise measurement. 

An investigation of feasibi?.ity should proceed in 
several different phases. We have outlined a possible 
structuring of a feasibility investigation in a previous 
paper (Hills and Beard, 1984)* This paper presents the 
results of the phase one studies proposed in that paper. The 
stud ies address the following questions. 

1. Will existing IRT computer programs work 
satisfactorily using the SSAT-II data? 

2. Is the assumption of un id i mens ional i tv valid for 
the SSAT-II data? 

3. Do the two- and three-parameter models fit the SSAT- 
II data better than the Rasch or one-parameter model? 

4. Are the guessing (c) parameters estimable for the 
CSAT-II data (using the LOGIST 4 computer program)? 

5. How many examinees are needed to estimate the 
parameters? 

6. Can the parameter estimates be improved by 
oversampling the lower end of the ability distribution? 

In order to answer the first question we analyzed 
actual SSAT-II data using LOGIST 4 (Wood, Wingersky, & Lord, 
1976) to see what would happen. There was a possibility that 
the easiness of the SSAT-II tests would cause problems in 
the estimation of the a and c parameters. These problems 
and the results of the trial analyses are given in chapter 
two. 

The second question is addressed in chapter three. 
All currently used operational measurement models assume 
unidimensionali ty; however, there is always the possibility 
that more than one dimension is beinc^ measured in an 
achievement test. The SSAT-II tests have been developed by 
specifying a number of more or less separate skills, each 
with several items. Such an approach could result in 
serious multidimensionali ty. On the other hand/ if there 
are only a few items for each skill, if achievement of one 



skill is correlated with achievement of others, and if many 
skills are included in a test (as is the case here), then a 
single "superdimension" may account for most of the variance 
in the scores, and the test may function as a unidimensional 
measure of that superdimension* Previous analyses by King 
and Hills have already indicated that the scores of the SSAT 
II meet a widely-used criterion for unidimensional ity. A 
more elaborate investigation of that assumption is described 
in chapter three. 

The next inquiry explored the goodness of fit of the 
one-, two-, and three-parameter models to the SSAT-II data. 
The person and item parameters were estimated using LOGIST 4 
and then used to estimate the responses to items of 
individuals, and the proportion correct for groups of 
individuals in several ability intervals. The differences 
between the estimated and actual results were summarized for 
each model in several ways. This study comparer in a 
relatively direct way the fits of the three models to the 
SSAT-II data. 

The fourth question is related to the first. If 
difficulties were to arise in applying the three-parameter 
model to the SSAT-II data it was believed that it would 
relate to the estimation of c parameters. Wingersky 
(personal communication. Summer, 1984) has suggested that 
since the items are so easy, there may be little guessing, 
and there may be no real need for precise estimates of c 
values. It might be sound to consider a two-parameter 
model, involving only difficulty and discrimination, or a 
modified three- parameter moael using something like the 
reciprocal of the number of decoys in an item as the c value 
instead of estimating it from the data. The results of a 
parameter estimation investigation are described in chapter 
five along with a consideration of the appropriateness of 
different models. 

Questions five and six are dealt with in chapter six. 
Studies of the optimum number of examinees to be used in 
analyzing the data available from Florida's tests were 
completed and are described in this chapter. The question 
has a great deal of importance because the more persons in 
an analy^jis, the more expensive it gets. For every 
additional person another parameter (ability) must be 
estimated. Current estimates in the literature range from a 
low of 700 persons to satisfactory results only with 3 or 4 
times that many. Recommendations of sample size for SSAT-II 
mathematics and communications are made on the basis of the 
results of empirical studies described in this chapter. 

An attempt was made to Improve parameter estimates by 
oversampling the lower end of the achievement distribution. 
Heavier sampling was done among those of extremely low 
ability levels because that is where the estimation 
difficulties arise and where most data are needed. This was 



investigated simultaneously with the sajnple size and 
parameter estimation problems. 

The above questions are fundamental and it appears that 
the problems associated with them can be resolved 
satisfactorily. They constitute the first phase of an 
empirical investigation into feasibility. Several further 
studies would seem appropriate for a second phase of the 
feasibility investigation. 



CHAPTER 2 



CONVERGENCE OF LOGIST 4 ON MINIMUM COMPETENCE TEST DATA 

In evaluating the feasibility of using the three 
parameter model for the Statewide Assessment Tests, a basic 
question is whether the LOGIST 4 program would successfully 
analyze the Florida data. Eignor reported (D. W. Eignor, 
personal communication. Summer 1984) that in 1979 ETS tried 
to use LOGIST 4 to estimate the item parameters for the New 
Jersey minimum competency tests. He indicated that they 
encountered severe convergence problems, i.e., after many 
iterations the program would not reach the end. The 
problem seemed to involve estimating the c parameters when 
the items were so easy (as minimum competency tests tend to 
be.) Eventually they tried using a two-parameter model, 
ignoring guessing or fixing it at a specific value, but they 
still had problems with convergence. A basic problem then, 
was to whether LOGIST 4 would converge using Florida 
Statewide Assessment Data (SSAT II). 

The data for the March, 1984, administration of the 
SSAT II were obtained. The file contained data for 127,033 
cases, but from these we removed any who were taking the 
test for the second time, any who were not in the tenth 
grade, the normal time for taking the SSAT II, any who were 
classified as deaf, hard of hearing, physically impaired, 
emotionally handicapped, educably mentally handicapped, had 
specific learning disabilities, or were educably mentally 
retarded. The remaining group contained 94,261 cases, too 
many for our purposes. A systematic sample of 9000 was 
taken from this group. 

The data from the tape reported each student's response 
to each item. From the data we determined whether the 
student had answered the item correctly, had answered 
incorrectly, had omitted the item, or had not reached the 
item. Omitted items are those which are not answered 
correctly or incorrectly before tne last item which is 
answered. Not reached items are items for which no answer 
has been made after the last item for which an answer is 
made. The responses were then recoded in the format 
required by LOGIST 4, with a correct answer recorded as 01, 
an incorrect answer as 00, an omitted response as 10, and a 
not reached item as 11. 

The LOGIST 4 program available to us for use on the CDC 
Cyber computer is limited to 3090 cases. So we sampled once 
again, creating 5 different random samples of approximately 
3000 cases, with replacement between sampling. That is, an 
individual might appear in more than one sample, but he 
could not appear in the same sample more than once. 

The communications and mathematics item responses were 
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analyzed for each of the five samples of 3000 cases by 
LOGIST 4. In each of the ten cases (2 tests times 5 
samples), convergence was reached, usually in about 20 
stages. Thus, the problem experienced by Eignor does not 
appear to occur with these data. 

However, it is noticeable in these data that even with 
75 items and 3000 cases, many of the c parameters were not 
estimated, i. e., the program when unable to estimate a c 
parameter gives the item the average c parameter of the 
it ems for which c parameters have been estimated. The 
program does not estimate the c parameter routinely when the 
quantity b - 2/a is less than -2. When this quantity is 
equal to or less than -2, the c parameters are poorly 
determined (Wood et al., 1976). Many fewer c parameters were 
estimated for the communications test than for the 
mathematics test, since the average difficulty levels of the 
communications items were much easier than the mathematics 
items. The means and standard deviations of the a 
parameters and b parameters for each data set, with 
standardization on theta, appear in Table 2.1. The numbers 
of c parameters estimated by LOGIST for the mathematics and 
communications tests for each data set appear in Table 2.2. 

It is clear from these results that LOGIST 4 can be 
used to estimate a and b parameters for the SSAT II data in 
the Florida Statewide Assessment Program. However, it is 
also clear that the c parameters, which reflect the 
probability of a person with very low ability answering an 
item correctly (probably at least partly due to guessing), 
will require special attention. Routine use of LOGIST 4 
does not result in estimates of many of the c parameters. 
The tests are so easy that few of the examinees are far 
enough down the ability scale to allow us to ascertain what 
the performance of a very low ability person would be. In a 
later chapter we will concentrate on developing procedures 
to estimate c parameters. If that is not successful, we 
will consider the use of a modified three-parameuer model in 
which all c parameters have the same nonzero value, or a 
two-parameter model in which all c parameters are fixed at 
zero as though there were no guessing and a person of low 
ability would surely not be able to answer an icem 
correctly. 



Table 2.1 



Means and Standard Deviations of a and b Parameters for 
Mathematics and Communications Tests for Each of the 

Five Samples 



Mean 
S.D. 



.93 
. 33 



a Parameter 
Mathemat ics 
91 .93 .93 
>32 .3^ .33 



.91 
. 32 



Communications 
Mean .83 .80 .83 .63 .80 
S.D. . 31 . 28 . 28 . 31 . 28 



b Parameter 
Mathematics 
Mean -1.32 -1.39 -1.32 -1.33 -1.39 
S.D. 1.19 1.31 1.14 1.22 1.30 
Communications 
Mean -2.09 -2.11 -2.06 -2.09 -2.17 
S.D. .84 .94 .82 .83 .80 
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Table 2.2 



Numbers of Estimated c Parameters 
for Mathematics and Communications Tests 
for Each of the Five Samples 



Sample 

1 
2 
3 
4 
5 



Mathematics 

2 
2 
3 
2 
2 



Communications 

C 
0 
0 
0 
0 
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CHAPTER 3 



DIMENSIONALITY OF SSAT II MATHEMATICS AND COMMUNICATIONS 

Background 

Dimensionality is an important aspect of item response 
theory. The un id i mens i ona 1 theory, which we are using in 
this project, assumes that only one dimension or 
characteristic of the examinee ie determining whether he 
answers an item correctly, and that the same dimension is 
involved for all examinees. If that is correct, then the 
desirable attributes of item response theory can be expected 
to hold. If the test involves more than one dimension, the 
estimate ot the level of an individual on the characteristic 
being measured may not depend on which items he is asked to 
answer, and the parameters of the items u.ay not depend on 
which individuals are administered the test, (It might be 
useful to point out that classical test theory must make the 
same assumption of unidimensional i ty for scores to be 
meaningfully interpretable. This was largely overlo >ked for 
many years by all but tl^ose who were familiar with the 
concepts of factor an;»lysis.) 

Un i d i m en s j o na 1 i t y is assumed by the Rasch (one- 
parameter) model as well as by two- or three-parameter 
models, so evaluation of dimensionality is not critical to 
deciding whether to move £:on\ a one-parameter modc?l to a 
two- or three-parameter model for Statewide Sr.uient 
Assessment purposes. What is found about dimensionality is 
relevant for interpretation of total scores on the SSAT 
tests, regardless of how they are analyzed. There is a 
difference, however, between what happens in a one-parameter 
model and in a more complex model in the presence of more 
than one dimension. The one-parameter model weights the 
dimensions present approximately according to the number of 
items on each. A two- or three-parameter model gives 
dominant weight to the first or strongest dimenfiion 
(Reckase, 1981). In any case, it would seem pointless to 
move to a more complex model from a simple model if a major 
assumption of both were seriously violated. So checking of 
this assumption seems particularly appropriate at this time. 

It is highly probable that given enough items and 
examinees, any set of data would be in violation of an 
assumption of un i d i mens ional i ty (Hambleton & Swaminathan, 
1985). As Lord (1968) indicated when d iscuss ing the College 
Board's Scholastic Aptitude Test , unidimensionality for the 
SAT cannot be strictly correct, but it is probably a 
tolerably good approxi mat ion. This was evidenced by a 
factor analysis by Coffman which found 11 factors, but most 
of the variance (16%) was on the first unrotated factor, 
with only about 2% of the variance on any other factor. 
Similarly with any test, the issue is not whether 
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unidiraenslonali ty is violated^ but whether it is violated 
to such an extent that the results are no longer useful. 



Procedure would be simple if there were an easy and 
wel 1 -accepted procedure to te^t the assumption of 
unidimensionality. Unfortunately, that is not the case. In 
1968, Lord, on the basis of a factor analysis by Coffman, 
inferred that the Verbal test on the SAT was sufficiently 
unidimensional to be worth study by item-response theory. 
Similarly, Lord and Novick (1968) said that a unidimensional 
space is most likely to be an ".adequate approximation" for 
tests "that appear as though they ought to be homogeneous; 
for example, certain tests of vocabulary, reauing, spelling, 
and some kinds of spatial ability. On the other hand, we 
should expect a matheiaatics test made up half of arithmetic 
reasoning items and half of plane geometry items to show at 
least k-=2 dimensions" (p. 381). They go on to suggest that 
if the first latent root of a matrix of tetrachoric 
correlations with communalities in the diagonal is large, 
and the second latent root is nearly as small as the rest, 
"there is good reason to treat their data as arising from a 
one-dimensional latent space" (p. 382). 

In 1980 Lord pointed out the "great need for a 
statistical significance test for the unidimensionality of a 
set of test icems" (p» 21) and gave a rough procedure based 
on the latent roots of the tetrachoric item incercor relation 
matrix with estimated communalities in the diagonals. Again, 
the procedure required comparing the size of the first 
latent root with the second, and the second latent root with 
the others. 

Other authors have suggested using factor analysis 
also. Hulin, Drasgow, and Parsons (1983) suggest using the 
results of factor analyzing the matiix of interitem phi 
coefficients with principal axes ex*^raction and squared 
multiple correlations as estimates of communalities. 
However, they point out that this procedure will result in 
spurious factors due co differences in item dif f icult ief. 
They also indicate that the number of f^^tors obtained by 
factor analyzing interitem tetrachc^ric correlations provides 
a valid estimate of dimensionality if the sample is 
sufficiently large. With tetrachoric correlations there are 
also problems of missing values and non-Gramian matrices, 
and if there is guessing on the items, the assumptions of 
the tetrachoric correlations are violated. Carroll is cited 
by Hulin, et al. (1983) as providing a correction for the 
guessing problem^ but they found his correction to be 
inadequate to eliminate spurious factor.^- Hulin et al. 
(1983) finally recommend a different tactor analvsis 
procedure called "modified parallel analysis" based on work 
by Drasgow and Lissak as "being a way for researchers to 
determine whether in fact there is a latent variable that is 
strong enough to allow application of IRT in an item pool" 
(p. 261). 

9 

E^C 13 



Some authors have not been willing to afccept factor 
analysis of interitem correlations as an appropriate 
procedure for evaluating dimensionr 1 ity, McKinley and Mills 
(1984)^ for example^ after noting that "...there is no 
generally accepted procedure for testing the assumption of 
unid imensionali ty" (p. 1), go on to develop an approach 
based on analysis of the residuals derived from estimating 
the probability of a correct response based on an IRT model 
and the observed response to the item. They argue that if 
the data are one dimensional, the residuals will be random 
error. So, if the correlation matrix based on the residuals 
is factor analyzed, there should be no common factor. 
Analysis of residuals eli mi nates the problem of difficulty 
factors, since the residuals are on a continuous scale. 
Product-moment correlations :an be used, so there is no 
problem with non-Gramian matrices. Divgi (1980) is reported 
to have used such a procedure on the Rasch model, but 
McKinley and Mills applied it to two- and three-parameter 
models. They found that on their synthetic data, while 
principal components analysis of tetrachoric interitem 
correlations would not have discriminated between cne- 
dimensional sets of data and sets with several dimensions, 
the analysis of residuals was effective in making that 
discL imina ticn. For the residuals, a ratio of greater than 
1.2 to 1^0 between the first eigenvalue and the second was 
evidence of more than one dimension. 

Hattie (1984) identified 87 different indices of 
un id imens i onal i t y. He then evaluated each of them in a 
Monte Carlo study of tests 15 items in length, with 1, 2, or 
5 dimensions, various degrees of item discrimination, 
homogeneity, difficulty, and guessing. He simulated 500 
cases, with 24 replications. The results were very 
discouraging for factor analysis apnroaches — as well as most 
of the rest of the 87 indices. For example, he stat-s, 
"Indices based on component or factor analyses do not aid in 
determining ur. id imens io»^ ty" (p. 71), and "...indices 
based o tetrachorico car , recommended" (p. 72). Also, 

"Using the number of .values greater than one to 

estimate the number of factors appears to lack 
justif icat-.Dn" (p. 72). "...the ratio of first and second 
eigenvalues [wasjnot [an] effective [index]" (p. 72) And, 
"Overall, linear fact r analysis is not appropriate for 
det'-rmining un^ d imens ional i ty" (p. 73). He does find a 
study of residuals to be useful for the two-parameter model, 
but in his study does not go as f&r as to recommend a cutoff 
value for deciding that data are unidimensional. 

Reckase (1979) studied the problems of applying 
unifactor methods to multifactor tests and concluded that if 
the first principal component is large relative to the other 
factors "good ability estimates can be obtained from the 
(one parameter and three parameter) models even when the 
first factor accounts for less than 10 perci it of the test 
variance, although item calibration results will be 
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unstable. For acceptable calibration, the first factor 
should account for at least 20 percent of the test 
variance." (page 228). He suggests that sufficient 
unidimensionality for scable item parameter estimates should 
be present for 50-item tests when the first eigenvalue is 10 
or greater. 

Hambleton a-»d Swaminathan (1985) note that Bejar (1980) 
suggested that one evaluat^^ imidimensional ity by splitting a 
test into subtests based on content^ i.e., make a subtest of 
items that appear to test a different aspect of content than 
the rest of the items. For the items in that subtest^ 
obtain item parameter estimates twice^ once by themselves 
and once including them among the other items in the larger 
test. Compare the two sets of item parameter estimates. 

Much earlier^ in 1961, Lumsden (cited in Hambleton & 
Swaminathan ^ 1985) had sugg'^jsted a similar idea for 
constructing tests that would be un id i mens ional . He 
advocated factor analyzing a set of test items and removing 
items that did not measure the dominant factor. The 
remaining items would be factored again^ and again items not 
measuring the dominant factor would be removed. This would 
be repeated until a satisfactory solution was reached. 
Lumsden proposed that the ratio of first-factor variance to 
s e CO n d - f a c t o r variance be used as the index of 
unidimensional i ty . 

Martoisy Rickard^ and Stiles (1985) reported that in 
their analyses of data from the California Adult Student 
Assessment System^ the criteri^i proposed by Bejar and 
Reckase were in serious disagreement. Out of twelve testS/ 
Bejar's criterion found only one to be unidimensional^ but 
all twelve of them met Reckase's criterion. 

Procedure 

Obviously the problem of determining whether a set of 
data is sufficiently unidimensional for scores from the data 
to be used as measures of a single characteristic of 
examinees has not been jatisfactorily resolved. Many 
possible approaches have been advocated, and most of them 
have been criticized or found ina^lequate upon further study. 
Our approach was to pick one of the methods for evaluating 
dimensionality that has been widely used and generally 
accepted/ even though criticized, and use it to evaluate 
unidimensionality. Then to go further, we attempted to 
purify the data along the lines suggested by Bejar and by 
Lumsden and to determine whether "purification" of what 
already appeared to be unidimensional would result in any 
important change in item parameters. If no appreciable 
change In item parameters resulted from purification, then 
we wou3d conclude that the data were sufficiently 
unidimensional that item parameters resulting from its 
analysis would not be importantly affected by whatever 
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multidiroensionalil-y remained. 



The method for evaluating unidimensionality that we 
chose was the one recommended by Reckase (1979) of computing 
a principal components analysis on the phi correlations 
among items and examining the percent of varianf^e 
contributed by the first unrotated factor and the 
relationships among the eigenvalues ot the factors. If the 
first factor variance is 20% or greater^ and if the first 
eigenvalue is large compared to the test, with the remaining 
eigenvalues being similar in size to the second, the data 
would appear to be sufficiently uni dimensional for the item 
parameters from the three-parameter logistic model to be 
satisfactory. While Martois et al. (1985) found that 
Reckase*s and Bejar's methods did not agree, and Hambleton 
and Swaminathan poiater^ out the problems of using factor 
analysis in determining unidimens ionali ty, when Hambleton 
and Swaminathan reported an actual study in the same text in 
which they had been critical, they used Reckase's criterion 
(p. 275). Combining the Reckase and Bejar approaches should 
establish a conclusion on a firmer foundation than either 
taken alone. 

Data and Results 

The data from the SSAT II administration of March, 
1984^ were factor analyzed using the principal components 
method, unities in the diagonals. Random samples of 
approximately 750 cases were used in each analysis. 
Analyses were done separately for the communications and 
mathematics sections. The eigenvalues and percentages of 
variance accounted for by each of the first 6 factors appear 
in Table 3.1, below. 

It can be seen in Table 3.1 that in each case the 
percent of variance accounted for was near 20%, that the 
first eigenvalue was greater than 10 and very large compared 
to the second and subsequent eigenvalues, And that the first 
factor accounted for a considerably largijr proportion of the 
variance than the second and subsequent factors. Since 
items were intercor related using phi coefficients, the 
second factor, if it exists at all, is most likely a factor 
related to the difficulty values of the items. Judging from 
these results, the SSAT II mathematics and communications 
scores appear to be reasonably unidimensional. 

To explore dimensionality further, for the mathematics 
test the skill scores were factor analyzed using the 
principle components method. The seven skills which were 
least highly related to the first factor were deleted, 
creating a more homogeneous test. LOGIST 4 was used to 
estimate the item parameters and the abilities. 



ERLC 



Table 3*1 



eigenvalues ani Percentages of Variance Accounted For 



Mathematics 



Communica uions 



Eigenvalue 



Percent 



Eigenvalue 



Percent 



12.48 
2.40 
2.22 
1.73 
1.59 
1.53 



16.6 
3.2 
3.0 
2.3 
2.1 
2.0 



13.87 
2.46 
2.14 
1.87 
1.64 
1,57 



18. 5 
3.3 
2.8 
2.5 
2.2 
2.1 



standardizing on abilitie3^ for both the heterogeneous test 
using the o dd -n u m be r ed items (to equalize length 
approximately) and the homogeneous test using only the 
approximately half of the items (40 items on 8 skills) which 
loaded most highly on the first factor. (Twenty items on 
the homogeneous test were among the odd-numbered items on 
'.he heterogeneous test.) The differences between the 
parameter values for these two analyses were calculated, and 
the means and standard deviations of the twenty differences 
obtained. Those means and standard deviations appear in 
Table 3.2, below. (The statistics for the c parameters are, 
of coursef not very meaningful, since most c parameters are 
set at the average value of the estimated c parameters in 
these data.) Similarly, the means and standard deviations 
of the differences between abilities estimated from the two 
sets o£ it^ms were calculated for a systematic sample of 158 
cases. (Of the systematic sample of 158 cases, 13 persons 
who obtained perfect scores on either test were deleted 
leaving a reduced sample of 145 cases.) 

It can be seen in Table 3.2 that the mean differences 
in item para^if^ter values for the heterogeneous groups of 
items and the homogeneous groups of items, as defined above, 
are quite small, and their variability is not great. In 
other words, if content mul t i d i mens ional i ty is present in 
these data, removing the half of the items that ""east 
reflects the major dimension makes only slight differences 
in the itejn parameters. Thus, whatever multidimensional 5 ty 
is present is of little practical importance in estimating 
item parameters, "'he mean difference for thetas was .08, 
but the standard deviation was .39. What this implies about 
dimensionality is not clear at this point. In general, 
we interpret these results to indicate that the Statev/ide 
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Table 3.2 



Means and Standard Deviations of Differences 
In Item Parameters 



a 



b 



c 



Mean 
Standard Dev • 



.053 
.066 



.043 
.111 



.031 
.031 



Assessment Tests meet the criterion of un idi mensional i ty 
sufficiently well that results from parameter estimation 
procedures which assume unidimensionality, such as the Rasch 
model or the three-parameter model, will be quite useful. 

Attempts to compare item parameters and ability 
estimates from heterogeneous tests and homogeneous tests of 
communication skills were unsuccessful. LOGIST 4 would not 
converge using only 40 items and 3000 cases on a best 
composed of items as easy as the communications items in 
SSAT II. The mean item p value for the mathematics items 
was .81, with 16 of the 75 items having p values less than 
0.75, and the smallest p being .33. The mean item p value 
for the communications items was .90, with only 2 having p 
values below 0.75, and those 2 were .74. The smallest p was 
0.74. This suggests that the mathematics items are on the 
borderline of being unanalyzable by LOGIST 4, and the 
communications items are so easy that one must have a large 
number, such as 75 as in the total test, in order £or the 
program to function satisfactorily. 



Dimensionality is not an easy assumption to evaluate. 
New approaches to it appear frequently. It may be that no 
set of data with more than a few items is completely 
unidimensional. The fundamental ijsue for item response 
theory analyp^s is whether the data are so multidimensional 
that misleading interpretations are produced. Our 
impression from the above analyses is that the SS^T data are 
sufficiently unidimensional that the advantages of item 
response theory should be pursued. Nothing here suggests 
that grossly misleading results will be produced due to 
mul tidimensional ity . 



Conclusion 
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CHAPTER 4 



COMPARISON OF THE FITS OF THE ONE- TWO- AND 
THREE-PARAMETER MODELS TO FLORIDA'S 
SSAT-II DATA 

The appropriateness of the one-parameter ^ or Rasch 
model/ for multiple choice achievement test data has been 
questioned since its introduction. The Rasch model is based 
on the assumption that the response to a test item is based 
only on the ability of the person and the difficulty of the 
item. Proponents of the three-parameter model argue that 
one item parameter is insuf f icient ^ and that test data are 
modeled more closely by including additional ?tem parameters 
for guessinc^ and for item discrimination. Advocates of the 
Rasch model claim that problems in estimating these two 
additional parameters negate potential advantages in their 
use (Wright, 1977). 

Hambleton and Murray reviewed the literature on good- 
ness of fit and proposed three ways for addressing the fit 
of test data to an item response model. (1983, p. 72) 

a. Determine if the test data satisfy the assump- 
tions of the test model of interest. 

b. Determine if the expected advantages derived 
from the use of the item response model (for 
example, invariant item and ability estimates) 
are obtained . 

c. Determine the closeness of fit between pre- 
dictions and observable outcomes (for example, 
test score distributions) utilizing model 
parameter estimates and the test data. 

In a separate paper (1985) Murray reviewed the litera- 
ture on fit analysis a.,d presented a rationale for the use 
of exploratory analytic techniques without tests of statis- 
tical significance. The statistical significance tests 
which have been previously used have been shown to be sensi- 
tive to sample size, with larger numbers of misfitting items 
being found when larger sample sizes were used. The explo- 
ratory techniques involve an examination of size and direc- 
tion of residuals, descriptive statistics, and the identifi- 
cation of particular kinds of misfit. 

This study will focus on the third method above, the 
closeness of fit between predictions and observable outcomes 
utilizing model parameter estimate<> and the SSAT-II data. 

The analyses completed closely parallel those done by 
Hambleton, Murray, and Williams in an investigation of the 
fit of the one-, two-, and three-parameter models to item 
scores from the 1982 Maryland Functional Read ing Test 
(1983). They found larger differences between the one- and 
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two- parameter models than between the two- and three- 
parameter models. They concluded that the two-parameter 
model was adequately fit by the data. They speculated that 
the small improvement in fit gained by using the three- 
parameter model was related to the easiness of the tests 
which resulted in small amounts of guessing by the students 
who took the tests. 

The fit of *"he SSAT-II data to the Rasch model has also 
been studied in previous investigations of item fit (Beards 
Julian^ ano Roca^ 1984) and of person fit (Beards Julian^ 
Richards^ and Roca^ 1984). 

Method 

The analysis of residuals has been an integral part of 
tests of fit of data to the Rasch model since the first 
BICAL programs were prepared by Wright and Panchapakesan 
(1969). More recently^ the technique has been used to test 
the fit of the three-parameter model (Hambleton and Murray^ 
1983). The basic idea is to estimate parameters for a 
psychometric model and use the parameters to estimate item 
responses. The differences between the estimated aad actual 
responses constitute the residuals. These residuals are 
then analyzed in various ways to show the amount and nature 
of misfit of the data to the model. Models which more 
closely fit the real data would be preferred over those 
which fit it less well^ all other things being equal. 

Sample 

A random sample of 3,000 cases was selected from the 
population of 100,571 students who took the SSAT-II in the 
March 85 administration. The item responses were scored as 
right or wrong wrth omitted or "not reached" items also 
scored as incorrect. This sample provided the data for 
separate analyses of the mathematics and communications 
parts of the te^t. The numbers of cases in the analyses 
were reduced to 2929 for mathematics and to 2768 for commun- 
ications when zero and perfect scores were removed by the 
computer program. 

A nalyses 

The LOGIST 4 computer program (Wood et al., 1976) was 
used to generate the person abilities and item parameters 
for the one-, two-, and a modified three-parameter model 
from the 3,000 SSAT-II item response records. A modified 
three-parameter model was used because the LOGIST program 
could estimate c parameters for only a small number of the 
items. The modification consisted of fixing the "c" parame- 
ters at .20 for all items. The term "three-parameter 
model" will be used throughout this part of the paper to 
indicate the "modified three-parameter model". 
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These abilities and item parameters were then used to 
estimate responses and to compute average residuals for each 
item and across all 75 items by selected ability groups 
using a computer program entitled RESID developed by 
Hambleton and Murray (1983). The display of average re- 
siduals by ability levels shows the avei^^ge departure of the 
data from the item characteristic curves at different inter- 
vals on the ability continuum. The mathematics ability 
continuum was divided into twelve intervals and the communi- 
cations into ten groups in order to represent best the 
distribution of log ability estimates for the two sets of 
scores . 

For each score interval a residual was found by compu- 
ting the difference between the proportion of the examinees 
answering each i tern cor rec y (P) and the proportion expec- 
ted to ^nswer correctly b .ed on the item characteristic 
curve (P). Standardized re: .duals were computed by dividing 
each residual by the standard error of the proportion cor- 
rect . 



There was concern initially that the standardized resi- 
duals might be overly inflated by small standard errors of 
proportions resulting from easy test items. However^ an 
examination of the results showed that easier items did not 
tend to yield the highest standardized residuals. This 
finding is consistent with that of Murray (1985) who com- 
pared the use of raw and standardized residuals extensively 
and concluded that standardized residuals yielded results 
which were similar to raw residucils and provided the addi- 
tional benefit of taking into account sampling errors asso- 
ciated with the proportion correct (P-value) . 

The mean absolute values of raw and standardized re- 
siduals by ability intervals are shown in Table 4.1. The 
mean raw residuals are inversely related to the ability 
levels of the students; i. e.^ uhe residuals are greater for 
low ability and smaller for high ability students. This is 
:iue for the one-^ two-^ and three-parameter models. The 
raw residuals are greatest for the one-parameter model and 
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Table 4.1 

Average Absolute Values of Raw and Standardized Residuals 





Numbe;" 


As'. 


Absolute 




Av. Absolute Valued 






of 




Valued Raw 




Standardized 




Ability Examinees 


Residuals 




Residuals 




Level 


1-P 


2-P 


3-P 


1-P 


2-P 




1-P 


2--P 


J— r 


Mathenatics 



















-2.75 


9 


6 


16 


. 183 


.152 


• 116 


2.359 


1. 198 


1.048 


-2.25 


45 


27 


50 


.169 


.124 


• 062 


3.595 


1.638 


.973 


-1.75 


137 


131 


111 


.112 


.043 




3.547 


1.139 


.935 


-1.25 


323 


325 


289 


.067 


.024 


.022 


3.036 


.998 


.OO0 


-.75 


454 


482 


467 


.030 


.019 


.016 


1,619 


1.034 


.ool 


--25 


541 


556 


571 


.025 


.016 


. 014 


1.714 


1.073 


.986 


.25 


491 


494 


533 


.030 


.011 




2.596 


.995 


.878 


.75 


477 


432 


462 


.032 


.009 


.008 


3.249 


.910 


.822 


1.25 


291 


254 


231 


.025 


.011 


.011 


2.420 


.934 


.957 


1.75 


124 


133 


134 


.019 


.010 


.010 


1.434 


.763 


P1 P 
. OXO 


2.25 


74 


68 


66 


.015 


.018 


.015 


1.054 


.996 


QQQ 


2.75 


1 


21 


19 


.023 


.018 


. 0ZZ 


.233 


.657 




Overall 




















Average 






J, - 


.061 


.038 


ClOQ 


2.238 


1.028 




N 


2967 


2929 


2949 














Communications 


















-3.00 


18 


20 


17 


.203 


.120 


. Xt)J 


2.903 


1.361 


.960 


-2.50 


51 


32 


43 


.173 


.074 


.062 


3.401 


.918 


P71 


-2.00 


104 


97 


93 


.121 


.034 


.031 


3.249 


.724 




-1.50 


147 


166 


150 


.068 


.024 




2.406 


.707 




-1.00 


296 


283 


280 


.034 


.018 


oi^ 7 


2.049 


.836 


.776 


-.50 


464 


492 


484 


.029 


.010 


. V)Xt) 


2.657 


.768 


.772 


0.00 


401 


557 


556 


.031 


.006 


.006 


3.258 


.661 


.650 


.50 


602 


487 


505 


.026 


.007 


.007 


4.344 


.941 


.863 


1.00 


349 


341 


327 


.018 


.007 


.007 


2.760 


.967 


.943 


1.50 


336 


193 


186 


.011 


.015 


.016 


1.862 


1.418 


1.464 


Overall 




















Average 








.071 


.032 


.029 


2.889 


.930 


.866 


N 


2768 


2668 


2641 















least for the three-parameter model at almost all ability 
levels; however, the differences in mean residuals among the 
three models become trivial at the highest ability levels. 
These data also show the tendency of small ra* residuals to 
appear large when standardi zed by d ividing by standard er- 
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rors from large P-values. A mean raw residual (one-parame- 
ter) of .025 at an ability level of 1.25 yields a mean 
standardized residual of 2.420; whereas a raw residual (two- 
parameter) of .024 at an ability level of -1.25 yields a 
standardized residual of only .998. 

Several descriptive analyses were done in order to 
evaluate (a) the residuals for each of the one- two- and 
three-parameter models for each i tem ^ (b) the distributions 
of standard i zed residuals^ (c) the average residual for 
selected ability levels^ (d) the association between re- 
siduals and content categories^ (e) the association between 
residuals and difficulties of items^ and (f) the association 
between the items' residuals and discrimination indices. 

Resul ts 

The mean standardized residuals for the one-^ two-/ and 
three-parameter models are shown in Table 4.2 for Mathema- 
tics and Table 4.3 for Communications along with the classi- 
cal item analysis indices of difficulty and discrimination. 
The proportions correct reflect the easiness of this minimum 
competency test. The residuals are smallest for the three- 
parameter model and largest for the one-parameter. In fact^ 
none of the 150 standardized residuals for the two- or 
three-parameter models are as large as 2.00 which could be 
considered statistically significant. Sixty-nine of the 150 
residuals for the one-parameter model exceed this arbitrary 
significance value. A casual inspection of these data would 
certainly indicate that use of more item parameters improves 
fit to the item responses. 

The data in Tables 4.2 and 4.3 also confirm that in the 
one-parameter model difficult items show greater misfit than 
easier ones. For example^ note item 66 in the mathematics 
test. It is the most difficult item (p = .34) and also has 
the poorest fit to the one-parameter model (10.155). Its 
fit to the two- and three-parameter model is not signifi- 
cantly large (1.579 and 1.423 respectively). 

Tables 4.2 and 4.3 also show that the items within most 
skill areas vary considerably in difficulty^ discr iminat ion^ 
and fit to the IRT models. 

The distributions of fit indices for the three models 
are shown in Table 4.4. The total set of indices consists 
of one index for each of 75 items for each of the score 
intervals^ and can be arrived at by multiplying the number 
of items by the number of intervals^ 12 for mathematics and 
10 for communications. The residuals should be normally 
distributed if the data fit the model. Table 4.4 shows that 
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Table 4,2 
Mathematics Item Statistics 



Absolute-Valued 
Test Prop, Pt.Bi serial Standardized Residuals 

Item Skill Correct Correlation 1-p 2-p 3-p 



1 


1 


.986 


.117 


.838 


.837 


. 593 


2 


1 


.978 


.120 


1.257 


.447 


. 480 


3 


1 


.919 


. 227 


2.384 


1.067 


1.070 


4 


1 


.875 


.389 


1.455 


.860 


. 999 


5 


1 


.948 


. 370 


1. 127 


1- 099 


. 904 


6 


2 


.927 


.262 


1.628 


. 627 


. 736 


7 


2 


.594 


.427 


2.913 


.845 


. 576 


8 


2 


.878 


.383 


1.311 


.869 


.808 


9 


2 


.896 


. 312 


1.748 


. 778 


. 693 


10 


2 


.487 


.468 


2.140 


1.230 


.849 


11 


3 


.717 


.470 


1. 422 


. 971 


1. 189 


12 


3 


.971 


.231 


.779 


. 959 


.784 


13 


3 


,928 


.288 


1. 310 


1. 214 


.706 


14 


3 


.902 


.241 


2.945 


1.053 


.660 


15 


3 


.958 


.298 


1. CI18 


1. 020 


1.099 


16 


4 


.836 


.485 


.834 


.913 


.944 


17 


5 


.829 


.453 


1.126 


.883 


.906 


18 


5 


.897 


.402 


.921 


.889 


.745 


19 


5 


.851 


. 309 


3. 177 


1. 223 


1.013 


20 


5 


.925 


.440 


1.169 


.810 


.820 


21 


5 


.946 


. 332 


1.299 


1. 117 


.986 


22 


6 


.964 


.284 


.910 


1.126 


.673 


23 


6 


.896 


.366 


1.471 


1.082 


.813 


24 


6 


.950 


. 407 


1.216 


1.125 


.814 


25 


6 


.880 


. 515 


1.094 


.832 


.708 


26 


6 


.964 


.304 


1.202 


1.198 


1.174 


27 


4 


.847 


. 572 


1.474 


1. 080 


1. 147 


28 


4 


.867 


.429 


1.241 


1.343 


1.351 


29 


4 


.611 


. 243 


7.424 


1. 041 


1. 112 


30 


4 


.921 


.252 


2.036 


.846 


.648 


31 


7 


.680 


.489 


1.653 


.893 


.820 


32 


7 


.802 


.377 


2.266 


.947 


.805 


33 


7 


.854 


.429 


1. 019 


1. 124 


1. 105 


34 


7 


.881 


.412 


1.0C5 


1.129 


1.197 


35 


7 


.820 


. 340 


2.498 


.997 


1.142 


36 


8 


.980 


. 307 


1.430 


.982 


1.003 


37 


8 


.877 


. 402 


1. 086 


.802 


.628 


38 


8 


.831 


. 236 


4.839 


1.184 


.905 


39 


8 


.912 


. 261 


2. 125 


. 969 


.866 


40 


8 


.826 


.461 


1.057 


.910 


.832 



continued 
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(Table 2 continued) 



41 


9 


.749 


.241 


5.446 


.940 


.972 


42 


9 


. 748 


.261 


5.785 


. 655 


.856 


43 


9 


.735 


.503 


1.123 


1. 021 


.598 


44 


9 


.806 


.442 


1.438 


1. 432 


1.467 


45 


9 


.814 


.551 


1.451 


1.404 


1.109 


46 


10 


.803 


. 436 


1.597 


.895 


.896 


47 


10 


.834 


.235 


4.414 


.767 


.885 


48 


10 


. 828 


.320 


2.951 


.810 


.785 


49 


\0 


. 904 


.287 


2.050 


.755 


.713 


50 


10 


.864 


.260 


3. 270 


.976 


.895 


51 


11 


.888 


.377 


1.143 


.884 


.651 


52 


11 


.880 


.437 


.582 


. 777 


.956 


53 


11 


.881 


.269 


2.700 


.736 


.812 


54 


11 


.601 


. 411 


3. 216 


1. 201 


1. 074 


55 


11 


.6P1 


.534 


.941 


1.059 


1.008 


56 


12 


.774 


.432 


1. 561 


.788 


.832 


57 


12 


.756 


.343 


3.333 


.952 


.841 


58 


12 


, 372 


. 190 


8. 145 


1.468 


1.023 


59 


12 


.878 


.337 


1.740 


1.051 


.954 


60 


12 


.669 


.482 


1. 595 


1. 089 


1. 031 


61 


13 


.485 


.269 


7.351 


.960 


.918 


62 


13 


.787 


.497 


.989 


.911 


.776 


63 


13 


.623 


.477 


1.845 


.877 


.709 


64 


13 


.738 


.395 


2.732 


1. 041 


.747 


65 


13 


.776 


.387 


2.626 


.796 


.982 


66 


14 


. 340 


.132 


10. 155 


1. 579 


1.423 


67 


14 


.806 


.470 


1.948 


1.842 


1.114 


68 


14 


.707 


.431 


2. 430 


.488 


.514 


69 


14 


.641 


.446 


2.975 


1.321 


.876 


70 


14 


.781 


.454 


1.845 


1. 336 


1.095 


71 


15 


.722 


.540 


.844 


.882 


.340 


72 


15 


.692 


.439 


2. 484 


1.901 


.690 


73 


15 


.662 


.455 


2.320 


1.485 


.770 


74 


15 


.763 


.471 


1. 709 


1.497 


1. 124 


75 


15 


.822 


.471 


1.206 


1.201 


1.313 



21 

25 



Table 4.3 
Communication Item Statistics 



Absol ute -Valued 
"Test Prop, Pt.Biserial Standardized Residuals 

Item Skill Correct Correlation 1-p 2-p 3-p 



1 


1 


878 


• zoo 


4 ^Z 


1.005 


1.143 


2 


0 


Q 7 01 


0 "7 Q 

• z / y 


656 


• 68 1 


1.007 


3 


2 


Q7Q 


7 0 7 
• JZJ 


i • bbi 


• 878 


1. 010 


4 


0 


Q 7 P 
♦ 37/0 


• j//y 


T T C /I 

1^350 


• 688 


.991 






Q 7 01 


T T T 


1 r" r *l 

1^ 553 


• 710 


1. 008 


f; 


e: 


• 7 D 1 


• 44 0 


• 772 


• 860 


.787 


7 


1 


O / D 
• OHO 


• 321 


4^ 724 


• 505 


. 308 


Q 
O 


1 


QUI 

• o bl 


• 457 


2^007 


• 775 


.655 


Q 


Z 


Q "3 

• ^ Jz 


• 435 


1^ 168 


• 936 


.675 




Z 




• 402 


1^ 580 


• 562 


.509 




o 


Q P 1 
• 7 O 1 


• 312 


1^ 389 


1^ 124 


»746 


1 2 


o 


• y y i 


OCT 

• 267 


1^145 


1^ 174 


.956 




p 


Q 7 


• 22o 


4^ 729 


• 676 


.738 


14 


p 
o 


• y 1 fc9 


o o o 

• 2o2 


3^549 


• 907 


.875 


1 s 


p 
o 


Q ^ ^ 
• 7 4 D 


OCT 
r Zbl 


&• 553 


1^ 258 


. 707 






P "7 7 
• O / J 


• 4 11 


2^ 759 


• 568 


.649 


1 7 


J. 


• O D4 


C 1^ r 


1 • 279 


• 739 


.778 


J. o 


n 
1 


• o by 


TOT 

• 383 


3^ 122 


• 898 


.496 


1 Q 


n 

1 


O Q C 

• o y D 


• 530 


• 894 


1^ 202 


.947 


9CI 
Z Kl 


c 
D 


• o jy 


• 394 


3^ 057 


• 789 


.691 




i: 


Q 1 


Vl "7 0 
• 4 /Z 


1 n ff o 
i • 70O 


• 876 


. 986 


22 




Q 9 1; 

• 7 Z D 


R vi yi 
• D44 


• 9 6 J 


1^ 089 


1.104 


23 


D 


• O D O 


vi i; ^ 
• 4 00 


O nLC\ A 
Zm ]09*^ 


1^ 248 


.988 


24 


7 


P ^ P 
• O 4 O 


Vl "7 7 
• 4 / J 


1 o o c 

1 • Jz 5 


1^226 


.811 


25 


7 


P 1 1 
• O X ± 


• 4 J. Z 


0 "7 7 P 
Z • /JO 


1 T O /I 

!• /Z0 


• 964 


26 




P C9 


7 

• J J J 


4 • 1 0 10 


im 057 


1^149 


27 


10 


.834 


• 399 


3.841 


Q01 7 


7 17 


28 


9 


.823 


• 398 


3.479 


1. 378 


1.432 


29 


9 


.901 


• 337 


3.679 


.882 


.857 


30 


9 


.956 


• 330 


6.124 


1. 139 


.767 


31 


9 


.832 


• 224 


7. 519 


.899 


.967 


32 


9 


.973 


• 356 


1.818 


1.063 


.699 


33 


6 


.898 


• 507 


1. 104 


.767 


.744 


34 


6 


.917 


• 454 


.884 


1. 148 


.927 


35 


4 


.911 


• 372 


3. 008 


. 978 


.730 


36 


11 


.952 


• 507 


1.027 


.897 


.931 


37 


3 


.926 


• 482 


1. 086 


1. 134 


.919 


38 


11 


.936 


• 321 


2.652 


.458 


.451 


39 


6 


.881 


• 445 


1. 520 


1. 362 


1. 058 


40 


3 


.907 


• 524 


.667 


.840 


.614 
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(T aole 3 continued) 



A 1 




fi Q ^ 
• O 7 0 


7 P P 

• J oo 


0 ^ 7 

Z . 4 / 0 


• III 


. b30 


AO 




Q i; i; 

• 7 D D 


^ 01 "7 
« 4 0 « 


1 "3 
1 • DO J 


CI ^ 

. b /z 


T C e 

.756 


H. O 






Vl Q Q 

• 4 o 7 


1 . Z4 D 


. 9oo 


.972 


Ad 


1 1 

L L 


Til 


1 01 

• J r? D 


f . Z4 0 


1 . Zl 4 


1 1 yi yi 
1 . z4 4 


A^ 


1 9 


7Q 1 


• J Kf 4 


D . 100 1 


1 0 C,A 

1 . Z D 4 


1 yi "3 c 
1 . 4 J o 


Af\ 


1 9 


• 0 J / 


• J 4 Z 


A "311^ 


1 o 7 yi 
1 . Z / 4 


1 c o 
1 . J5o 


Al 




• 1 


• 4 o 5 


1 7 O 01 
1 . /o K 


1.14/ 


1 O Q C 

1.29b 


A C 


4 


• ^4 ^ 


• 4 J / 


1 /X Q >l 

1.094 


.748 


. 629 


A Ci 


\ 


one 

• o9o 


1 O A 

.399 


2.884 


.905 


. 919 


50 


1 /X 
10 


fl o o 

• 92o 


.403 


1. 850 


.637 


. 647 




/ 


• 924 


• 581 


1.442 


1.536 


1.056 


52 


4 


. 939 


• 488 


1. 084 


.855 


. 793 




1 1 


• 957 


.473 


.725 


.911 


.818 


54 


1 


r\ n f\ 

* 909 


r c c 

.555 


.867 


1 1 1 T 

1.117 


1. 027 


c c 
55 


1 n 
1 0 


ft o o 

• coo 


.265 


5. 369 


1.062 


. 785 


5d 


1 a 
10 


• ^ J 1 


.400 


1 OCT 
1 • Jb 3 


. 9o4 


1.019 


5 / 


10 


• r j9 


. O00 


7 All 7 
i« 01J 


a >l "7 


1.01b 


Do 




• 0^1 


. J41 


"3 O "3 O 
^« OJZ 


Q O £ 

. oo b 


TOO 

. /j9 




1 J 


• ^4 1 


. JOl 


1 Q 1 7 
1 • 91 / 


C >l C 

.545 


. /92 




1 J 


• ^40 


"3 C O 


O C7 ^ 

z • 5 /o 


. / OO 


1 . 009 


D i 


1 J 


oca 
• 7b 0 


. 4 y o 


7 C 7 

• /b / 


•7 "2 C 

. / jb 


i^c 
. /ob 


DZ 


1 J 


• 704 


*:> "3 >i 
. J j4 


1*010 


o c o 

. 7DZ 


•,919 


D J 


14 


Q ^5 1 
• 7^1 


O Q it 
. Z 74 


"3 ^ O 


7Q yi 

. /74 


. /d5 




1 A 
14 


• 74 D 


( J 1 5 


b . 007 


.909 


*7 yi 1 

.741 




14 




• 4 4 1 


1 c o 
1* j5o 


1 

. /5J9 


e 619 


D O 


1 A 
14 


/ 7 4 


1 P 
. O 1 o 


D . 70 / 


1 a yi o 
1 . 04o 


1.249 




1 A 
1 4 


• O / / 


0 *5 01 
. 5Z0 


4 . Wl b 


. bl 1 


•7 ft ft 

.799 


DO 




7/ 
• 0 / s 


O "3 
. Z!7 J 


C ACT 

5.05/ 


1.00b 


.521 






• * .w^ D 




7 ^P *^ 


1 01/ Q 
1 . 10 *s 7 


1 1 yi 
1 . 1 D 4 


70 


15 


.808 


.436 


2. 615 


.552 


.719 


71 


15 


.910 


.470 


1. 451 


.782 


.804 


72 


15 


.852 
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1.152 


73 


12 


.920 
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Table 4.4 



Levels of Absolute Values of Standardized Residuals 
for the Ttiree Logistic Test Models 



Number/Percentage of Absolute Values of St. Residuals 
Model 0 to 1 1 to 2 2 to 3 over 3 Ibtal 















1— fa r aiue ue r 


ICQ /A^ ^ 




iWl/ll* 


217/24% 


900/100% 


2-Parameter 


528/59* 


277/31% 


67/ 7% 


28/ 3% 


900/100% 


3^arameter 


578/64% 


258/29% 


55/ 6% 


9/ 1% 


900/100% 


0 nnunirations 












l-Parameter 


219/29% 


191/26% 


82/11% 


258/34% 


750/100% 


2-Paraineter 


477/64% 


213/28% 


46/ 6% 


14/ 2% 


750/im 


3-Parani2ter 


495/66% 


205/27% 


41/ 6% 


11/ 1% 


750/100% 


Normal Dist. 


6R% 


27% 


4% 


.25% 





the residuals for tha two- and three-parameter models ap- 
proximate normal distributions with slightly fewer small 
residuals and slightly more large residuals than normal. 
The residuals for the one-parameter model are obviously not 
normally distributed and tend toward rectar*qular distribu- 
tions having many moie large and fewer small residuals than 
would be expected for a normal distribution. 

The average standardized residuals and the average 
absolute values of standardized residuals are shown in Table 
4.5. These values show the relationship between amount of 
misfit for the three models and ability level of the exami- 
nees. The numbers ot ability intervals are different for 
math and communications because of the different distribu- 
tion shapes for the two sets of scores. When the communica- 
tions data were analyzed with 12 intervals ;^s for math^ an 
empty ability interval resulted because of the particular 
raw to log ability score conversion. The number of students 
in each of the intervals is al«?o shown 'n Table 4.5. 

The misfit of the one-parimeter model is larger than 
that of tha two- and three-parameter models throughout the 
ability scale. The only exception occurs for a cell in 
which the average residual £or the one parameter model was 
based on oily one student. 
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Table 4.5 

Mean Standardized Residuals at TV^lve Ability Levels 



for the Ttiree Logistic Models 



Number Average kw. Absolute Valuec 

of Standardized Standardized 

Ability Examinees Residuals Residuals 

Level 1-P 2-P 3-P 1-P 2-P 3-P 1-P 2-P 3-P 



Mathematics 



-2.75 


9 


6 


16 


1.879 


.394 


.117 


2.359 


1.198 


1.048 


-2.2j 


45 


27 


50 


2.948 


.433 


.170 


3.595 


1.638 


.973 


-1.75 


137 


131 


111 


3.200 


.430 


.168 


3.547 


1.139 


.935 


-1.25 


323 


325 


289 


2.859 


.608 


.420 


3.036 


.998 


.860 


-.75 


454 


482 


467 


1.064 


.145 


.280 


1.619 


1.034 


.861 


-.25 


541 


556 


571 


-.933 


-.306 


-.114 


1.714 


1.073 


.986 


.25 


491 


494 


533 


-2. 088 


-.202 


-. 256 


2.596 


.995 


.878 


.75 


477 


432 


462 


-2.662 


-.060 


-.069 


3.249 


.910 


.822 


1.25 


291 


254 


231 


-1.659 


.275 


.192 


2.420 


.934 


.957 


1.75 


124 


133 


134 


-.518 


.419 


.402 


1.434 


.763 


.818 


2.25 


74 


68 


66 


.071 


.385 


.450 


1.054 


.996 


.898 


2.75 


1 


21 


19 


-.032 


.248 


.176 


.233 


.657 


.694 



Overall 

Average — — .344 .231 .161 2.238 L.028 .894 

N 2967 2929 2949 _ — — — 



Ccaimunicatic . 



-3.00 


18 


20 


17 


2.166 


.339 


.103 


2.903 


1.361 


.960 


-2.50 


51 




43 


2.297 


.405 


.117 


3.401 


.918 


.871 


-2,00 


104 


97 


93 


1.754 


.291 


.144 


3.249 


.724 


.677 


-1.50 


147 


166 


150 


.415 


.091 


.036 


2.406 


.707 


.685 


-1.00 


296 


283 


280 


-. 534 


.086 


.077 


2.049 


.836 


.776 


-. 50 


464 


492 


484 


-1.791 


-.003 


.009 


2.657 


.768 


.772 


0.00 


401 


557 


556 


-2.942 


-.050 


.076 


3.258 


.661 


.650 


.50 


602 


487 


505 


-3.905 


.126 


.083 


4.344 


.941 


.863 


1.00 


349 


341 


327 


-1.960 


.434 


.420 


2.760 


.967 


.943 


1.50 


336 


193 


186 


-,421 


.629 


.587 


1.86: 


1.418 


1.464 


Overall 




















Average 








.492 


.235 


.150 


2.889 


.930 


.866 


N 


2768 


2668 


2641 
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The average standardized residuals, as contrasted to 
the average absolute values, may be viewed as indicators of 
bias in response estimation. If residuals, or errors, in 
response estimation are randomly distributed around zero we 
would expect the mean of the standardized residuals to be 
near zero for each ability level. The overall mean resi- 
duals for the one-parameter model are small, .344 and .492 
for math and communications respectively. However, these 
low means are caused by a consistent pattern of positive 
residuals for low scorers and negative residuals for high 
scorers. This pattern is probably caused by items being 
guessed correctly by students at lower achievement levels. 
For the two- and three-parameter models there was a tendency 
toward negative residuals near the zero log ability level 
and positive ones near the low and high extremes. The same 
pattern was present in the Maryland Functional Reading Test 
data (Hambleton, Murray, and Williams, 1983), but was not 
discussed in that paoer. The pattern would seem to result 
from overestimates of ability in the middle of the range. 
Further research is needed to explain this pattern of resi- 
duals. 

The average absolute values of standardized residuals 
show a consistent pattern of larger values for the one- 
parameter models with slightly smaller values for the three- 
than for the two-parameter model. This pattern is generally 
consistent throughout the ability range and for both math 
and communications tests. 

All three of the item response theory models used in 
this part of the study assume unidimensionality. This as- 
sumption is suspect when the model is applied to tests such 
as the SSAT-II. These are minimum competency tests, each 
based on 15 specific skills. If one or more of these skills 
measures a trait which is substantially different from the 
dominant trait of the total test, the difference might be 
reflected in misfit for those skills. In order to identify 
such misfit, the residuals for the three models were aggre- 
gated by skills. These mean residuals by skills are shown 
in Table 4.6 along with the corresponding mean p-values and 
point-biserial discrimination indices. 

The data in Table 4.6 show a general decrease in mean 
residuals as one goes from the one- to the two- to the 
three-parameter mciels. None of the mean residuals for the 
two- or three-parameter models approaches the value 2.00 
which might be concidered significant. On the other hand, 
the mean one-parameter residuals of several of the skills in 
both mathematics and communications exceed this val ue. The 
mean residuals for the one-parameter are also considerably 
more variable than those of the two- and three-parameter 
models. 
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Table 4.6 

Mean Standardized Residuals by Skill 



Mean Standard! zed Residual 

Mean 

Numbe r 

of One Two Three 

Skill Items P-Value Pt.Biserial Parameter Parameter Parameter 



Mathematics 



Ml 


5 


. 941 


. 245 


1.412 


• 862 


. 809 


M2 


5 


. 756 


. 370 


1. 948 


.870 


.732 


M3 


5 


.8jL 


. 306 


1. 495 


1 n A 

1. 043 


.888 


M4 


5 


.816 


. 396 


2. 602 


1 n A c 

1. 045 


1.040 


M5 


5 


. 890 


. 387 


1. 538 


• 984 


.894 


M6 


5 


.931 


. 375 


1. 179 


1.073 


.836 


M7 


5 


. 807 


. 409 


1 1/1/1 
1. 700 


1 • 01o 


1.014 


M8 


5 


.885 


. 333 


2m 107 


c\ ^ c\ 
• 969 


Q A 1 
. O 4 / 


M9 


5 


. 770 


Ann 

• 400 


o n A f\ 

3. 049 


1 AAA 

1« 090 


1. 000 


M10 


5 


.845 


^ A O 




Q Vl 1 
• OHX 


Q ^ Q 
. O J D 


Mil 


5 


. 786 


• 40b 


1 TIC 
X . /ID 


m X 


Q 01 01 


M12 




. 690 


TCI 

• 357 




ImV) it) 


Q 'J C 
. y J 0 


M13 


5 


. 682 


A n rr 

. 405 


J. 107 




o o c 

. O ^ 0 


M14 


5 


. 655 


TOT 

.387 


3.871 


1. 616 


1. 004 


M15 


5 


. 732 


.475 


1. 713 


1.393 


.847 


Communicat ions 










CI 


5 


.870 


. 421 


2. 862 


.828 


.782 


C2 


5 


.955 


.356 


1.881 


.749 


.838 


C3 


5 


. 920 


.451 


1. 405 


.890 


.829 


C4 


5 


. 922 


.436 


1. 970 


.927 


.873 


C5 


5 


. 902 


.452 


1. 852 


.836 


.843 


C6 


5 


.878 


.444 


1.952 


1.116 


.973 


C7 


5 


.868 


. 476 


1. 904 


1.316 


.855 


C8 


5 


. 958 


.268 


3. 473 


1.028 


.804 


C9 


5 


. 897 


. 329 


4. 524 


1.072 


.944 


C10 


5 


. 904 


.355 


3.087 


.907 


.837 


Cll 


5 


. 907 


. 402 


2. 663 


.830 


.840 


C12 


5 


.849 


.307 


4.818 


.954 


1.001 


C13 


5 


. 939 


.377 


2. 122 


.771 


.849 


C14 


5 


. 901 


.338 


4.300 


.832 


.833 


C15 


5 


. 836 


. 352 


4. 520 


.894 


.890 



It appears that the misfit of the one-parameter model 
is adequately taken care of by the addition of the discrimi- 
nation parameter. Efforts to relate this improvement in fit 
to tho mean p-value and discrimination indices were not 
productive except for the tendency of one-parameter misfit 
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to be associated with the more difficult skills. 



The subject matter of the skills and the texts of their 
constituent items were examined for association with im- 
provements in fit from the two- to * he three-parameter 
model. No obvious anomalies in item form^ subject matter 
content^ or placement of the items within the test were 
observed. 

The associations between difficulties of the items and 
the means of the absolute values of the residuals are shown 
in Table 4.7. The items for each of the tests were arbitra- 
rily divided into Iwo groups; easier and harder. The mean 
absolute values of the standardized residuals were then 
computed for each difficulty group for each IRT model. The 
results show again the larger mean residuals for the one- 
parameter model. Furthermore^ the tendency of difficult 
items to misfit the one-parameter model is obvious in the 
larger mean residuals for the harder items than for the 
easier ones. This difference is less pronounced for the 
two-parameter model and is reversed for the three-parameter 
math results. It appears that the use of the two-parameter 
model would almost eliminate the association of difficulty 
and misfit found in the one-parameter model. That associa- 
tion would be virtually eliminated with the three-parameter 
model • 

Table 4.7 

Relationship between Item Difficulty and the Absolute-Valued 
Standardized Residuals for the SSAT-II 



Difficulty # of 1-Parameter 2-Parameter 3-Parameter 

Level Items Mean SD Mean SD Mean SD 



Mathematics 

Hard(p<.8) 27 3.22 2.48 1.08 .31 .88 .23 

Easy(p>.8) 48 1.68 .91 1.00 .23 .90 .21 



Conmunications 

Hard(p<.9) 31 3.87 1.93 .98 .28 .90 .28 

Easy(p>.9) 44 2.20 1.58 .89 .21 .84 .17 



Table 4.8 shows the relationship between item point- 
biserial correlations and the mean absolute values of stan- 
dardized residuals for items by IRT model. The chi-squared 
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Table 4.8 



Relationship Between Item Point^iserial Correlations and 
Absolute-Valued Standardized Residuals for the SSAT-II 



Item ft>int-biserial Correlations 



0 to .20 .21 to .40 above .40 

Model Residuals (N«4) (N=35) (N=36) 



Mathematics 



One- 0 to 1.0 




25% 




5.7% 


16.7% 


Parameter 1.1 to 2.0 




25% 




37.1% 


63.9% 


Model over 2.0 




50% 




57.1% 


19.4% 


Chi-squared=12.11 


df=4 




p=.016 


eta=.340 




1V»o- 0 to 1.0 




50* 




60% 


47.2% 


Parameter ].l to 2.0 




50% 




40% 


52.8% 


Model over 2.0 




00% 




00% 


00% 


Chi-squared=1.18 


df=2 




p=.550 


eta=.093 




Three- 0 to 1.0 




50% 




80% 


61.1% 


Parameter 1.1 to 2.0 




50% 




20% 


38.9% 


Model over 2.0 




00% 




00% 


00% 


Chi-squared=3.72 


df=2 




p=.156 


eta=.107 





Communications 



One- 0 to 1.0 




0. 0% 




0.0% 


25.0% 


Parameter 1.1 to 2.0 




7.7% 




23.3% 


59.4% 


Model over 2.0 




92.3% 




76.7% 


15. 6% 


Chi-squared=35. 13 


df 


=4 


p=.000 


eta=.668 




TVio- 0 to 1.0 




46.2% 




70.0% 


65.6% 


Parameter l.l to 2.0 




53.8% 




30.0% 


34.4% 


Model over 2.0 




0.0% 




0.0% 


0. 0% 


Chi-squared=2.30 


df= 


=2 


p=. 320 


eta=.175 




Three- 0 to 1.0 




69.2% 




66.7% 


81.3% 


Parameter l.l to 2.0 




30.8% 




33.3% 


18.8% 


Model over 2.0 




0.0% 




0.0% 


0. 0% 


Chi-squared=1.82 


df= 


=2 


p*.403 


eta=.156 
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t show a statistically significant relationship for the 

one-parameter model and insignificant relationships for the 
two- and three-parameter models. The eta correlation coef- 
ficients show that the relationship is substantial for the 
one-parameter model and trivial for the two- and three- 
parameter models. The relationship between item discri- 
mination and fit is further shown in Figure 1. Two things 
are obvious in this figure. First, the negative relation- 
ship between discrimination and fit for the one-parameter 
model is clearly shown. Second^ the absence of such a 
relationship for the two- and three-parameter model is ap- 
parent. The inclusion of the a or "discrimination" parameter 
in the model effectively eliminates the misfit caused by 
differing item discriminations or slopes of item charact- 
eristic curves. The amount of misfit^ but not the relation- 
ship, is reduced slightly further by the addition of the "c" 
or "guessing" parameter. 

Conclusions 

The item response data from the mathematics and commun- 
ications parts of the SSAT-II fit the two- and three- 
parameter better than the one-parameter IRT model. The 
misfit of the one-parameter model appears to be related tc 
the difficult/ of the items and to varying discrimination 
indices. Most or the misfit can be eliminated by the addi- 
tion of a discrimination parameter. The addition of a 
guessing parameter reducew ^sfit slightly, but the reduc- 
tion may be so small that the consequences would be practi- 
cally insignificant 

The three-parameter model used in this study was in 
fact a modified three-parameter model in which fixed c or 
"guessing" parameteis were supplied to the "LOGIST" parame- 
ter estimation program. The estimation of c parameters 
proved to be possible for only small numbers of these items. 
Therefore, the choice of models, from the standpoint of fit, 
would be between the two-parameter and the modified three- 
parameter models. 

The failure of the three-parameter model to improve fit 
significantly over the two-parameter was probably a conse- 
quence of the easiness of the items. Guessing would be 
minimal with such easy items. Therefore, the two-parameter 
model can be recommended for the current SSAT-II test. If 
the test were made substantially more difficult, then the 
use of the c parameter m ight improve fit significantly and 
should be invest iga ted further. 

The results of this part of the study are clear. The 
use of the two- or three-parameter model would significantly 
improve fit over the one-parameter model. However, the 
pract ical consequences of the misfit are not precisely 
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Figure 1 . Relationship Between Item Discrimination and Fit 
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known. The standardization of residuals takes into account 
the sampling error of the estimated proportion correct; i. 
e., when the number of subjects is small the residual is 
divided by a larger number to arrive at the standardized 
residual. However^ division by the standard errors also 
results in larger numbers, making small raw residuals appear 
large, especially when the items are Vc^ry easy and the 
estimated proportions correct are near 1.00. Therefore, 
although the results clearly show better fit for the two- 
and three-parameter models, those results should be kept in 
perspective with other findings showing the effects of 
misfit on students' scores and the relative utility and 
complexity of the models. 
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CHAPTER 5 



ESTIMATING C PARAMETERS AND 
CONSIDERATION OF THE APPROPRIATE MODEL 



Problems in Estimating the c Parameter 

The estimation of c parameters in the three-parameter 
item response theory model is an important issue. It has 
long been know that the accurate estimation of these 
parameters is a difficult problem, and this is especially 
true for easy items. For example. Lord (1968) wrote, "On 
easy items on which even the poorest examinees do better 
than chance, the value of c is poorly determined by the 
data. On such items, the estimated value of c was 
arbitraily (sic) fixed at the chance level, 0,20. •••Values 
of c below 0^20 may arise because examinees prefer a wrong 
distracter in an item to a right answer" (p^ 1011)^ Later 
ill the same article. Lord wrote, "It was foundr however, 
that many or most values of cj are rather poorly determined 
by the data^ • • • In the c 3 of the easier items, on which 
even examinees with low \ rbal scores performed above the 
chance level of ^20, cj was arbitrarily fixed at .20^ The 
errors in this procedure are surely much smaller for such 
items than the sampling errors in the maximum likelihood 
estimates, which might turn out to have some unreasonable 
value such as cj= ^60 or 0^= -•10^" (p^ 1014)^ 

The above quotations are from the first published 
report of use of the three-parameter model on a published 
test^ Eight years later. Wood et al^ (19X6, p^ 18) wrote 
that Lord had shown in 1975 that "the cj*s are poorly 
determined when (bj - 2/aj) is £ -2^" After further study 
and use of the model, estimation of the c parameters was 
still a problem^ Tg quote Lord again, he more recently 
(1980) stated, "• • •bj is unstable wh')n Pj or Qj is smalls 
The situation is much worse when Lo*"h a^ and bj must be 
estimated simul taneously. If only a few examinees answer 
item i incorrectly, it is obviously impossible to estimate 
aj with any accuracy" (p, 185) • He went on to say, "The 
problem is even more obvious for Cj, which represents the 
performance of low-ability examinees. If we have no such 
examinees in our sample, any reasonable value of Cj will be 
able to fit the data about as well as any other. If we 
arbitrarily assign some plausible value of Cj and then 
estimate a ^ and bj accordingly, we shall obtain a good 
description of our data" (p, 186), 

Several years later, Hulin et al, (1983) summarize a 
study by Ree done in 1979 by stating, "Clearly, even long 
tests and large samples do not necessarily allow accurate 
estimation of c" (p« 100) • 



33 

37 



Hutten (1980) explored the question of whether c 
parameters would even be estimated by LOGIST A, the most 
widely used computer program for estimating the parameters 
using the three-parameter model. As we shall see^ that 
program does not make estimates of c parameters for 
individual items when it seems that those estimates might be 
subject to large errors or lack of stability. In those 
casesy the program gives the item a sort of average c value^ 
and many unestimated items may be given that same value for 
the c parameter. Using twenty-five data sets of 1000 
examinees on each of a number of widely used standardized 
testSy such as the Stanford Achievement Series^ the 
Comprehensive Tests of Basic Skills^ the Iowa Tests of Basic 
Skills^ the California Achievement Tests^ and the Scholastic 
Aptitude Test of the College Boards Hutten concluded that 
"Although there are many indicators that ICC's are 
characterized by non-zero (sic) lower asymptotes^ only ono- 
third of guessing parameters were estimable by the LOGIST 
program for samples of 1000 examinees" (p. 30). She 
suggested that . .it may be more practical to set the 
lower asymptote to some reasonable value without attempting 
more refined estimation" (p. 30). 

Thissen and Wainer (1981) studied the standard errors 
that would be obtained for estimates of item parameters in 
the three-parameter model if unrestricted maximum-likelihood 
estimation procedures were used. They concluded that ". . . 
the use of an unrestricted maximum likelihood estimation for 
the three-parameter model either yields results too inexact 
to be of any practical use^ or requires samples of such 
enormous size so as to make them prohibitively expensive. 
This problem arises for items that are easier than average. 
This effect is the result of the huge covariance. . . 
between location and lower asymptote. When an item is 
relatively easy there are few observations available to 
estimate the lower asymptote thus making its standard error 
very large. The large covariance between lower asymptote 
and location then causes this uncertainty to move partially 
to the estimate of location. • . . The two-parameter model 
has problems as well^ but they are far less severe" (p. 7). 

They go on to say^ "The estimates [of location] are 
hopeless for easy items. This is fortunate^ for it is on 
items like this where the guessing parameter is not 
required. This provides hope that a hybrid model that 
computes a lower asymptote for difficult items and doesn't 
for easy items may be useful when the assumption of no 
guessing does not seem to be plausible" (p. 9). The 
situation is not entirely hopeless^ they point out^ since 
even with these large standard errors^ using the entire ICC 
to reproduce the data can be useful. However^ the problem 
is most severe when the location parameters are to be used 
separately^ such as their use in test equating/linking or in 
computerized adaptive testing. To display the magnitude of 
the problem/ they point out that if the lower asymptote is 
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not homogeneous, i*e*, the c values for all the items are 
not the same, accurate easiness values for easy items are 
obtainable only with Ns of 100,000 or so* 

As Thissen and Wainer point out, one solution to the 
problems of the large standard errors of unrestrained 
maximum-likelihood estimation is to abandon the unrestrained 
approach. A Bayesian scheme would be appropriate, with 
prior estimates of the parameter values constraining the 
posterior estimates. This is essentially what the computer 
program LOGIST 4 does. De Gruijter (1984) points out that 
this, in itself, may result in biased estimates, but that 
estimates are biased if the lower asymptote is fixed, also. 
However, De Gruijter concludes that "... it is clear that 
reasonable parameter estimates for the three-parameter model 
are possible, using prior and collateral information on the 
lower asymptote" (p. 272). 

Wood et al. (1976) describe the constraints in LOGIST 4 
as follows: 

To avoid wild fluctuations in the 
oaramater e;>timates, the amount that each 
parameter can change is restricted as 
follows: ... [the absolute value of the 
change in Cj is equal to or less than] 
CLAMBDA, where CLAMBDA is read in or set 
to the defaulj: value of .06. ... Since 
many of the c's are poorly determined, 
any movement is severely restricted. The 
restrictions are based on ^n 
approximation to the standard error of c. 
. . ., on an approximate median value for 
all the c denoted by C, on the proportion 
correct adjusted ^or omits . . ., on the 
value of (bj - 2/a \) , on a mii;imum c-^^ 
>0, and on a maximum Cj^^j^ <• 5. The 
rest;^^r ict ions^ on the range deffhe bounds 
on cj that cj may exceed only if its 
standard error is small. The 
approximation to the standard error is 
the square root of the diagonal element 
for cj in the inverse of Tj. 

An absolute minimum of 0 and maximum of 
.5 is placed onjc^. In Lord (1975) it is 
shown that the Ci''s are poorly determined 
when (bi - is jC -2. The Cj»s are 

better determined when -2 < (bi-2/aj) < - 
1, and the cj's should be well dete^^mined 
when ibj-2/ai) > -1. Therefore if c < pj^ 
theQ C{ is automatically held fixed at c 
if bj-2/ai becomes less than FEPS (-2, or 
a user supplied value) (pp. 16-18). 
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The c in that quotation is described several pages 
later in the same document as follows: 



A good value for c^^^ is 1/A (number of 
choices per item)^ for c is Cj^^j. - .05# 
and Cj^jp should be the san'^ distance 
below CT as Cj^-^j. is above c. In each 
stage where all of the items are 
estimated/ the program computes the 
median of the values (for the fir^st 
iteration of each item) to which c's 
currently allowed to vary would move if 
they were completely unrestricted. If 
the new median differs from c by more 
than ± .005/ c is changed to the new 
median and c^^^ is adjusted to keep Cj^^^ 
the same distance below the new c as Cj^-^^^ 
is above the new c. (p. 20) 

The criterion (b-2/a) is called the stability criterion^ and 
it is describee" in the user's guide for a later version of 
LOGIST (Wingersky^ Barton^ and Lord/ 1982) as follows: 

The stability criterion (b-2/a) is the 
ability level at which the proportion of 
correct responses is only about .03 above 
the lower asymptote Cc"); if there are 
few examinees with ability estimates 
below this levels then a stable estimate 
of "c" cannot be obtained. In other 
wordS/ we cannot estimate well the "c" 
paran>oter for very easy items/ or for 
moderately easy items with low "a" 
parameters. . . .A value of -3.5 for 
CRITFIXC is suggested for samples of 2000 
to 3000 examinees; a value of -2.5 is 
reasonable for smaller samples, 
(pp. 20-21) 

Nature of the SSAT ::i Tests 

The tests of SSAT II are very easy tests. The 
distrioution of scores for both the mathematics test and the 
communications test are highly negatively skewed/ as can be 
seen in Figure 5.1. The skew index for communications was 
-2.4/ and the index for mathematics was «»1«1. This suggests 
that we will have difficulty estimating c values for these 
tests by means of LOGIST 4. Supporting that premonition is 
Eignor*s experience (personal communication/ Summer/ 1984) 
with the New Jersey minimum competency tests. When an 
attempt was made to use LOGIST 4 to estimate item parameters 
for the New Jersey testS/ severe convergence problems were 
encountered. Even fixing c parameters/ or setting them to 
zero, did not alleviate the problem. If the program does 
estimate c values/ it may simply give most or all of the 
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items the c value described ab-^ve. 



Several authors have suggested that one might 
oversamp]e at the low end of the distribution in order to 
have large numbers of examinees with low scores and thereby 
sufficient data to estimate ^ith some degree of accuracy the 
c parameter^ the probability of someone of very lew ability 
answering -lorrectly^ Lord (1968) f^uggested this^ as did 
Thiss**n and Warner (1981). 

Another possibility that might be explored is to reduce 
the critical value of c below which c is used by the program 
instead of an estimate of c for the individual item. In 
fact, for LOGIST 5, Wingersky et al. (19B2) suggest setting 
CRITFIXC at -3.5 for samples of 2,000 to 3,000 instead of 
using the value of -2 recommended tor LOGIST 4. 

One more possibility exists for obtaining about the 
same effect as lowering the critical value of c. Thai: is to 
standardize on b parameters instead of on thetas. The usual 
procedure is to standardize on thetas, since the scale is 
arbitrary. However, one has the option of standardizing on 
b parameters. Since these items are very easy for the 
examinees, standardizing on thetas results in a negative 
average b parameter. If the average b parameter is already 
-1, not many of the easier items will fail to meet the 
critical value of (b-2/a), and thus few individual cs will 
be estimated. If we standardize on bs, the average b value 
is automatically zero, and more items might have estimated 
cs above the critical value. 

Of course, the problem with these approaches is that if 
we simply arrange to have more cjs estimated but the 
estimates are unstable, as Lord (1975) has indicated, we may 
not have profitted much from the exercise. Still, as far as 
we know, explorations of these issues have not been reported 
in he literature, so their study here has not only 
practical importance for SSAT II analyses of the future, but 
also has a more general significance for others wno might 
attempt to employ the three -parameter model using LOGIST 4 
as the estimating program for the item parameters. 

Therefore, this study was aimed at determining how 
effective the various strategies of: 

1) oversampling at the low end of the distribution, 

2) using a lower critical value of c, i.e.,(b-2/a < -2), and 

3) standardizing on b rather than theta, 

singly and in combination, would be at causing more cjs to 
be estimated and fewer CTs to be reported. 

lr< addition to attempting to obtain more c^s, we w^^nted 
to evaluate how well the f] , b, and c values agreed from one 
analysis to another. One might be distressed if he found 
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that using one or more of these techniques resulted in a 
much higher proportion of estimates of c values for 
individual items^ but the corresponding a values did not 
agree with a values from other procedures^ or corresponding 
b values did not agree with b values using other procedures. 
Furthermore^ it is also possible simply to fix the c values 
at zero^ or at some value that more reasonably reflects the 
effects of guessing and other extraneous variables^ such as 
1/A - .05. The former would be a two-parameter model, with 
a parameters reflecting item quality and b parameters 
reflecting item difficulty^ and an assumption that no 
guessing existed — that a person of very low ability had zero 
probability of getting an item correct. The latter might be 
called a modified three-parameter models indicating that 
three parameters are used^ but the third is not estimated 
for individual items but is set at a common fixed value for 
all items based on the number of alternatives. 

Data 

Data from the March^ 1984^ administration of SSAT II 
were used in these analyses. A random sample of 3000 cases 
was taken from a systematic sample of 9000 cases from the 
total population of 94^261 usable cases. Another sample was 
drawn for each test from the 9000 case sample by eliminating 
the lowest-scoring 50 cases (assuming that their scores are 
so low as to represent fai lures in test administration 
rather than sound estimates of ability)^ and then taking the 
lowest-scoring thousand cases and a random sample of 2000 
from the rest of the distribution. This procedure was 
believed to provide samples heavily over represent ing the 
examinees of low ability on each measure^ mathematics and 
communication. 

Analyses 

For the representative sample^ LOGIST 4 was used to 
estimate item parameters for each test^ communications and 
mathematics. Firsts the standard options on the program 
were used. These options standardize on thetas and set the 
criterion for (b-2/a) at -2. Then the same analysis was 
done standardizing on b parameters. Next, the same analyses 
were done setting the cr i terion for (b-2/a) at -3 and, 
again, at -4. For thesjf analyses, it was observed that the 
limit on the ma;(:imum a parameter of 2 was inhibiting. A 
large number of a parameter values were fi^ed at the maximum 
allowed (2.0). Therefore, the maximum on a was raised to 3. 
This same set of analyses was repeated using the data which 
overrepresented the low scorers. 

The results of these analyses were evaluated to 
determine which of the above procedures, or combinations of 
them, seemed most effective at yielding increased numbers of 
estimates of Cj for individual items. The most effective 
procedure was used, then, to provide input for new LCXSIST 4 
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analyses of the same data for comparisons with the results 
obtained from fixing c values at 0.20, (1.0/A •05) 
representing chance, and from fixing c values at zero. 

The results from tt^e various procedures used to 
increase the number of c s indicated that for the 
communi cat ions test, no arrangement resulted in estimating 
more than nine of the c parameters. Apparently that test is 
so easy for the examinees that the population con ta ins 
insufficient members of low enough ability to estimate the 
probability that a person of very low ability will answer 
these items correctly. 

For the mathematics test, a number of c parameters 
were estimated using these devices. The number is still 
fewer than half (only about one-third), so the results ^re 
not entirely satisfying. Changing the standard for (b - 
2/a) to -4, by itself, was about as effective as -3, but 
better than -2* Using a ^criterion of -2 resulted in 
estimates of 12 individual c parameters, o^ 16%, but using 
-3 resulted in estimates^of 16 individual c parameters, or 
21%. Standardizing on bs instead of thetas was about as 
effective as changing the standard to -3 or -4. Using the 
standard of -4 and weighting the low ability end of the 
sample by taking 1000 lowest scoring students and a randT 
sample of 2000 from those scoring above these 1000 cases 
resulted in the largest number of c parameters being 
estimated. So the c parameters from using a standard of -4 
and the nonrepresentat ive sample were input as fixed values 
to LOGIST 4 as one of the procedures to be compared. Since 
we were unable to find a way to estimate a useful number of 
c parameters for the communications test, it was not further 
studied . 

To compare ICC parameters resulting from fixing cs at 
different values, we contr^sted results from the following: 

A. Use c parameters based on -4 as the criterion for 

(b - 2/a) a id ^ nonreprosenta ti ve sample as 
fixed values for a LOGIST run--a threes- 
parameter model. 

B. Fix c pa-'ameters at (1.0/NCH - .05), i. e., at 

essentially a chance value—a modified three- 
parameter model. 

C. Fix c parameters at zero, i. e., as they would be 

fixed in a two-parameter model. 

Results 

As car be seen in Table 1, the mean a values and mean b 
values and their standard deviations were very similar when 
the c parameters were fixed at values obtained from LOGIST 
and when c parameters were set at chance level {.2). The 
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average a and b values and their standard deviations setting 
c equal to zero were noticeably different. 



Table 5.1 

Means and Standard Deviations of a and b Parameters 
W hen Fixed c is Injput from LOGISTy 
Set at .2, and Set at Zero 



Log is t 

Estimated c c set at .2 c set at 0 



Mean .915 -1.385 .912 -1.370 .831 -1.578 
S.D. .325 1.201 .324 1.190 .302 1.056 



The correlations between estimates of a parameters 
using c parameters fixed in three different ways can be seen 
in Table 5.2 where "Input c" is obtained by fixing c at the 
level estimated by a previous run of LOGIST 4 setting the b- 
2/a criterion at -4 and oversampling low scoring examinees. 

Table 5.2 

Correlations between Parameter Estimates Using Different 
Procedures for Fixing c Parameters 



1 2 3 

1. Input c .97 .82 

2. c = .2 .97 .87 

3. c = 0.0 .82 .87 



The correlation bet /een a parameters when c is set at 
0.20 and when c is input from LOGIST 4 estimates is very 
high, .97^ but neither of these procedures for estimating c 
parameters results in a parameters which are highly 
correlated with those obtained from setting c equal to zero. 

Frequency distributions of the differences between a 
values based on the three procedures for determining c 
parameters, and frequency distributions of the differences 
between b parameters based on the three procedures appear in 
Tables 5.3 and 5.4. These data clearly support the 
conclusion that the results from the three-parameter model 
and the modified three parameter model are quite similar. 
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and they are both quite different from results from the two- 
parameter model. 

Table 5.3 

Frequency Distributions of Differences Between a Parameters 
Based on Three Procedures for Fixing c Parameters 
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Midpoint 
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Table 5.4 

Frequency Distributions of Differences Between b Parameters 
Bas p Three Procedures for Fixing c Parameters 



Interval 



Midpoint 
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Conclusion 



From these data and analyses, it seems apparent that 
the c parameters obtained from setting the standard of b-2/a 
at -4 while using a nonrepresenta t ive sample weighted at 
the low end and from fixing the c parameters at chance level 
are very similar, but the results from fixing c parameters 
at zero differ appreciably from the other two procedures. 
Since fixing the c parameters at zero flies in the face of 
the fact that the items are multiple-choice items in which 
it is unreasonable to expect that a person of very low 
ability would have probability of zero of answering 
correctly, the most sensible recommendation appears to be 
the simple one of setting the c parameters at a value such 
as 1/A-.05 for all the items and relieving LOGIST 4 from 
having to estimate the c parameters at all. This 
recommendation is sound for the SSAT II mathematics test 
based on the above analyses; it is sound for SSAT II 
communications because no procedure was found that enabled 
LOGIST 4 to estimate c parameters for a significant number 
of items. 

It should not be surprising that using 1/A-.05 as 
estimates for c values corresponds fairly well with our best 
estimates of c parameters by means of the LOGIST 4 program. 
LOGIST 4 at best estimated c parameters for only about 1/3 
of the mathematics items, and it set the c parameters to a 
common value for the rest. The common value was not far 
from .20. So the c values for most of the items were not 
far from the c values one would set based solely on 
knowledge of the number of alternative responses. 

The program LOGIST 5 includes among its improvements 
over LOGIST 4 modifications in the procedure for estimating 
c parameters. The common c (c) is estimated by maximum 
likelihood procedures; the only restrictions on c are bounds 
of 0.0 and 0.5, and the fixing of c at a common value if b- 
2/a is less than a critical value which has been set; and, 
the c parameters which have been set at a common value due 
to violation of b-2/a are re-estimated in the last step of 
the program. Whether these modifications in LOGIST would 
decrease the problems of estimating c values for minimum 
competency tests to an appreciable degree deserves 
exploration. The program BILOG uses a Bayesian procedure to 
estimate item parameters. Bayesian procedures set 
constraints on item parameters in terms of prior probability 
distributions Instead of the less formal constraints set by 
LOGIST 4. It is possible that a Bayesian approach would 
result in better estimates of c parameters in minimum 
competency data. This possibility also deserves 
exploration. 
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CHAPTER 6 



DETERMINING APPROPRIATE SAMPLE SIZE AND 
DISTRIBUTION SHAPE 

The Significance of Sample Size and Distribution Shape 

The number of examinees which must be used to obtain 
satisfactory estimates of item parameters is a significant 
consideration in determining whether it would be wise for 
the State of Florida to use one model or another of the 
family of item response theory models. The fewer parameters 
that are estimated^ the smaller the number of examinees 
needed for stable estimates of parameters. Also^ some 
parameters are easier to estimate stably than others. Item 
difficulty seems to be rather easy to estimate satis- 
factorily^ but the guessing parameter seems to be quite 
recalcitrant. Since the current procedures are to use 
pretest samples of around 500 cases^ and that does not seem 
to be onerous^ a relevant question is whether the use of a 
more complex model than the one-parameter model would be 
feasible with samples of about that size. If not, an issue 
in adopting a more complex model would be the feasibility of 
obtaining larger pretest samples on which to obtain item 
parameter estimates. 

The usual minir.ium sample sizes stated for the three- 
parameter model are 1000 examinees and 40 items. Some 
computer programs require more^ but these numbers are those 
usually suggested for LOGIST 4. However^ if one were to 
consider a modified three-parameter model in which the c 
parameters were fixed at a specified value, such as chance, 
the program would have one of its troublesome estimation 
problems removed as well as having a smaller number of 
parameters to estimate. As a result fewer examinees might 
be needed. If one were to consider a two-parameter model, 
with all c parameters set at zero, the program might also 
require fewer cases. Experience with these data seems to 
suggest that estimation of c parameters is impossible for 
the communications test data and largely unproductive for 
the mathematics data (only about 1/3 of c parameters 
estimated at best). Therefore, it is important to see what 
savings in numbers of examinees are feasible by eliminating 
estimation n parameter, and whether it would be 

feasible to ute pretest samples of about the size now being 
used (500) when estimating parameters using LOGIST 4 and a 
modified three-parameter model or a two-parameter model. 

Another way to attempt to obtain stable estimates of 
parameters is to oversample the extremes of the population 
distribution so that more data are available for evaluation 
of the asymptotes of the logistic curve. In a sense, the 
problem of estimating c parameters is that there aren't 
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enough low-scoring examinees to fix accurately the shape of 
the logistic function at the lower end. Lord (1968)^ for 
example^ says that an item characteristic curve can be 
considered as the regression of item score on ability^ and 
the problem of estimating aj and bi can be considered as a 
problem of estimating a regression. He says^ "It is well 
known that data having a normal or bell-shaped distribution 
of the independent variable are not nearly as efficient for 
estimating a regression curve as data containing many 
extreme values of the independent variable. Thus^ item 
character istic curves can probably best be estimated by 
using a group of examinees with rectangular or bimodal 
distribution of ability" (p. 1017). In his analysis of the 
SAT Verbal score he used 2^682 examinees chosen from a 
larger group of 5220, using more low-scoring than high- 
scoring examinees "since the sampling error of low scores is 
highy due to random guessing" (p. 1018). Even then^ he 
suggested that a still more nearly rectangular, or even U- 
shaped distribution, might have been preferable. 

The distributions of SSAT II scores are highly skewed, 
as has been noted earlier. The communications scores are 
exceedingly skewed, much more skewed than the mathematics 
scores. A result we have seen from this is that LOGIST 4 is 
unable to estimate c parameters for the communications 
items. The question remains, however, if we sampled from 
the available population of examinees so that the low end 
was greatly overemphasized, would we be able to obtain 
reasonably sound estimates of the item parameters using 
smaller numbers of examinees? If so, and if the State 
Department of Education can identify schools or locations in 
which large numbers of examinees at the low and high ends of 
the ability scale can be pretested conveniently, 
oversampling might be a feasible procedure for obtaining 
sound item parameters efficiently. 

Procedure 

Before any of the studies described in previous 
chapters had been done, the following procedure was planned 
for this analysis. A matrix of samples would be assembled, 
one dimension being sample size, the other being sample mix. 
Sizes such as 3000, 2000, 1500, and 1000 would be used. 
Three different mixes of abilities would be crossed with 
these sample sizes. One mix would be the abilities that 
naturally occur, a second would overemphasize the extremes 
of ability by a factor of 3, and the third would 
overemphasize the extremes of ability by a larger factor. 
It was planned that these factors might be modified 
depending on the kinds of distributions that naturally 
occur. The data would be analyzed in each cell using LOGIST 
4 in an effort to locate the optimum sample size and mix for 
future analyses of Statewide Assessment Test data. 

Experience with other phases of this project guided us 
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in revising our procedure. First, our previous analyses 
suggested that the estimation of c parameters might not be 
practical for these data. Therefore, we decided not to 
pursue the estimation of c parameters throughout all levels 
of analyses for these steps unless the initial ones were 
encouraging. Second, it appeared that it would be necessary 
to locate fairly precisely the minimum number of cases 
required to make reasonably precise and stable estimates of 
a and b parameters. The logistics involved in the currant 
trial administration of items argues for small samples. It 
is relevant to obtain information on whether the current 
sampling procedures for trial administrations can be 
followed unaltered while using a more complex model. Third, 
the extreme skew made oversampling of different degrees very 
difficult. In Figure 5.1, for instance, it can be seen that 
the lowest-scoring 1000 out of 9000 cases would include all 
the students who had answered correctly from zero to 
approximately 60 out of 75 items on the communications test 
while the highest-scoring 2000 would all have scores of 74 
or 75. To use the lowest-scoring 1500 would include some 
rather high scores in the "low-scoring" group. The other 
alternative would be to sample from a larger number than 
9000 which becomes rather cumbersome. 

As a result of these considerations we did LOGIST 
analyses for the following sets of data: 

For mathe:natic5>, using the existing distribution we 
analyzed reprf:sentat ive samples of 3000, 1000, 750, 
500, and ?5i). We used both LOGIST-es t i ma ted c 
parameters and fixed c parameters (c parameters 
fixed at .20) for the samples of 3000 and 1000, but 
fixed c parameters only for the samples of 750, 
500, and 250. We oversampled the low end, and did 
LOGIST analyses of 3000 cases with both estimated 
and fixed c parameters. For a sample of 1000 we 
did an analysis with estimated c parameters to 
verify our anticipation that few if any c 
parameters would be estimated even with 
oversampling at the low end. 

For communications, we did not believe that 
anything we could do would result in useful numbers 
of c estimates. Therefore, we did not bother with 
oversampling at all or with estimating c parameters 
at all. We merely tried to see what would happen 
to the estimates of a and b parameters as the 
sample size decreased, with c fixed, using samples 
of 3000, 1000, 500, and 250 cases. 

Results 

We compared the means, the standard deviations, and the 
intercor relations of a and b parameters using different 
methods of estimating in order to see what happened as 



46 

50 



sample size became smaller and as we oversaiDpled the lower 
end. We assumed that the analyses ^'j^inq 3000 examinees and 
estimating c parameters were the most sound, and other 
procedures would be evaluated in terms of how closely their 
results approximated those of N-3000* As a result of work 
on other aspects of this project, we had available one 
analysis on N'=3000« We drew another random sample of 3000 
from the 9000 sample so that two samples of N=3000 could be 
compared as a base line* These two samples are labeled 
N3000A c-est and N3000B c-est in Table 6.1. We also ran 
LOGIST 4 on these samples with c parameters fixed at .20. 
Those two analyses are labeled N3000A c-fix and N3000B c^ 
fix. A final sample of 3000 cases was analyzed, N3000-S, c- 
est, in which the lower end of the distribution was 
dramatically oversampled following Lord's suggestion. 

It can be seen in Table 6.1 that there is quite good 
agreement between random samples using the same method with 
Ns of 3000, and that estimating c parameters or fixing c 
parameters at ,20 makes little difference in the means and 
standard deviations of a and b parameters. However, as one 
might expect, oversampling at the low end increases the mean 
b parameter. (Since the scale is fixed so that the mean 
ability of the group is zero, and since the mean ability is 
lower with greater sampling at the low end, the difficulty 
of the items scales out at a higher value, i. e«, less 
negative.) The mean a parameter is also larger when low 
abilities are oversampled. Notice that the mean c parameter 
value is only .18 with low abilities oversampled, but, when 
c parameters are estimated, the mean c parameter is 0.24 
when a representative sample is used. A lower c parameter 
allows the item characteristic curve to follow a steeper 
shape before becoming asymptotic, and the a parameter of the 
curve is thus larger. (Consider that if the c parameter 
were some large value such as .90, no a parameter could be 
very large. All slopes would be flat — and the item could do 
little discriminating, indeed, with all that guessing going 
on>) Thus, the baseline data on samples of 3000 behaves 
appropriately. 

Now when we move to representative samples of 1000, the 
means and standard deviations of a and b parameters are much 
like those for representative samples of 3000, whether c 
parameters are estimated or fixed at .20. With oversampling 
at the low end, however, the mean a parameter is increased 
noticeably (from .9 to 1.2), and the a parameters are more 
variable (standard deviations increasing from .3 to .4.) 
More dramatically, the mean b parameters a re increased 
noticeably more than they were for tne oversampled group of 
3000. Instead of the mean b parameter of -.8 we have a mo.an 
b parameter of about -.3 with reduced variability. So 
oversampling results not only in disagreement with 
representative sampling of N=3000, but also in disagreement 
with the oversampled data for N«3000. Oversampling the low 
end wich N of 1000, thus, seems to produce results unlike 
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Table 6.1 



Means and Standard Deviations 
of the Item Parameter Estimates 



Sample 



N3e!00A c-est 


.93 


-1.32 


.24 






M OiA\ 

^ J. . ) 




N3000B c-est 


.91 


-1.39 


.24 




\ • i^i 


\ ± • ^ ^ } 


\ * ^ ) 


N3000A c_fix 


.91 


-1.37 


.20 




( 7? \ 

\ m Ot* } 




\ m UU } 


N3000B c-fix 


.89 


-1.43 


.20 






n 251 


( . 00) 


N3000-S c-est 


1.02 


-.85 


.18 






( 1. 06) 


( . 03) 


N1000 c-est 


*88 


-1.48 


.19 




i 31 \ 


(1. 29) 


(.01) 


N1000 c-fix 


.87 


-1.47 


.20 




i 31 \ 


( 1. 30) 


( . 00) 


N1000-S c-est 


1.15 


-. 30 


.18 




( .42) 


(.85) 


(.05) 


N1000-S c-fix 


1. 15 


-.27 


.20 




(.40) 


(.85) 


(.00) 


N750 c-fix 


.94 


-1.38 


.20 




(.35) 


(1.35) 


(.00) 


N500 c-fix 


.89 


-1. 52 


.20 




(.34) 


(1.60) 


(.00) 


N250 c-fix 


.90 


-1.89* 


.20 




(.43) 


(3.08) 


(.00) 



*3ased on 74 items. Item 36 deleted from calculations due 

to the extreme value estimated for the b parameter. 

Note: "c-est" means that LOCIST estimated the c parameter^ 

and 

"c-fix" means that the researcher fixed the value at a 
constant of .20. 

Note: "S" after the sample size indicates skewness 
different from parent population (oversampling of lower end 
of ability). 

Note: "A" and "B" Indicate two equal-sized random samples. 
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those obtained with larger samples and seems unattractive as 
a procedure for obtaining sound estimates of a and b 
parameters* 

When sample si ze is reduced to 7 50^ c parameters being 
fixed since they are unlikely to be estimated stably with 
such small samples, the mean a parameters are quite similar 
to those for N=3000* So are the a standard deviations, and 
the mean b parameters* The standard d viation of the b 
parameters is perhaps a little larger than for N=s3000 (1*3> 
vs. 1.19 or 1.25). For N of 500 and N of 250 the mean a 
parameters are quite similar to the sample of 3000, and the 
standard deviation differs only for N of 250, and only a 
small amount there (.43 vs. about .33). However, the mean 
of the b parameters drops to -1.5 instead of about -1.4, and 
more noticeably, the standard deviation increases from about 
1.2 to 1.6 for N=500 and 3.1 for N»250. 

Table 6.2 shows the correlations among most of these 
estimates of a parameters. The two samples of N-3000, one 
with c estimated and the other with c fixed correlate ^97, 
encouraging one to believe that one set of results could be 
substituted for the other with impunity. When represent- 
ative sampling is used, the correlations with similar 
procedures between N=1000 and N=3000 are both .91. The same 
value results when oversampling is used but other procedures 
are similar between the sample sizes. When c is fixed, the 
correlation between the results from N=3000 and N=750 is 
0.88. For N = 500, the correlation is .84, and for N = 250 it 
falls to .75. One might wonder if the relationship is 
satisfactory when N is below 750, or, at most 500. 

Table 6.3 shows similar statistics for the b 
parameters. Here the correlations remain satisfactorily 
high until N becomes as small as 250. 

Table 6.4 shows the correlations among the c parameter 
estimates for those samples in which c parameters were 
estimated. When representative sampling was used, the 
correlations among c parameters are very high (.99) for 
large samples (both N=3000). But between N of 3000 and N of 
1000, the correlations drop to .61 and .55. The 
correlations between analyses using oversampling at the low 
end for one anal ys is and representative sampling for the 
other are disappointingly low, all below .40. These results 
add to the results of other steps in suggesting that for 
these minimum competency tests, use of an estimated c 
parameter may not be very helpful. 

From these analyses we might conclude, then, that for 
the mathematics test, oversampling at the low end seems to 
result in parameter estimates that disagree markedly with 
those from taking a large representative sample. Using 
LOGIST 4 estimates of the c parameter seems unpromising for 
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Table 6.2 



Pearson Correlation Coefficients 
for Mathematics a Parameters 





(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) 


(10) 




3000 


3000 


3000 


1000 


1000 


1000 


1000 


750 


500 


250 




A 


A 


S 






S 


S 










est 


fix 


est 


est 


fix 


est 


fix 


fix 


fix 


fix 


(1) 


1. 02 




















(2) 


.97 


1.00 


















(3) 


.92 


.90 


1. 00 
















(4) 


.91 


.86 


. 91 


1.00 


















Q 1 


Q3 


96 


1.00 












(6) 


.79 


.79 


.85 


.78 


.82 


1.00 










(7) 


.74 


.76 


. 87 


.79 


.82 


.90 


1.00 








(8) 


.85 


.88 


.85 


.85 


.89 


.75 


.75 


1.00 






(9) 


.85 


.84 


. 85 


.92 


.94 


.77 


.77 


.83 


1. 00 




(10) 


.76 


.75 


.69 


.75 


.79 


.68 


.59 


.84 


.81 


1.00 



Note: All coefficients are significant (p=.001) 
Note: Only samples "A" of N=3000 are shown. 

these minimum competency test scores. For the b parameters, 
one might comfortably use representative samples of as few 
as 250, but for the a parameters he might more safaly stay 
above 750, or at least 500, examinees. 

For the communications scores, remember that we were 
unsuccessful at estimating c parameters using LOGIST 4 for 
these veiy easy items. So we did not oversample at the low 
end for these analyses, nor did we estimate c parameters. 
We merely fixed c at .20, and then did analyses on samples 
of 3000, 1000, 500, and 250 cases. The results for a and b 
parameters appear in Table 6.5. For these data the mean a 
and b parameters were near the same value for Ns of 3000, 
1000, and 500, but changed appreciably for N of 250. 
However, Table 6.6 indicates that the correlation between a 
parameter estimates becomes rather low when K decreases 
below 1000. A similar result appears in Table 6.7 for the b 
parameters • 
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Table 6.3 

Pearson Correlation Coefficients 
for Mathematics b Parameters 





(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) '8) (9) 


(10 




3000 


3000 


3000 


1000 


1000 


1000 


1000 750 500 


250 




A 


A 


S 






S 


S 






est 


fix 


est 


est 


fix 


est 


fix fix fix 


fix 


(1) 


1. 00 
















(2) 


.99 


1. 














(3) 


.98 


< 


0 












(4) 


.95 


,96 


.99 


1.00 










(5) 


.95 


.96 


.99 


.99 1 


. 00 








(6) 


.95 


.96 


.97 


.96 


.96 1 


. 00 






(7) 


.95 


.96 


.97 


.97 


.97 


.99 1 


. 00 




(8) 


.86 


.87 


.89 


.91 


.91 


.91 


.91 1.00 




(9) 


.95 


.94 


. 92 


.91 


.91 


.87 


.87 .77 1.00 




(10) 


.83 


.81 


.84 


.85 


.85 


.73 


.74 .68 .90 1 


. 00 



Note: All coefficients are significant (p=.001). 
Note: Only samples "A" of N=3000 are shown here. 

From these analyses, it seems that with these very easy 
minimum competency tests (skew incices below -2.2, and 
minimum item p value of .74), not only are LOGIST 4 
estimates of c not feasible, but estimates of a and b 
parametei ^ are shaky unless based on Ns of at least 1000. 



Concl usions 



We set out to determine whether by fixing c parameters 
or by oversampling at the lower end of the distribution it 
would be possible to istimale item parameters more 
accurately or with smaller numbers of cases than are needed 
with a full three -parameter model. We found slightly 
different rosjlts for mathematics and communications. For 
mathematics, oversampl.-- at the low end has a noticeable 
effect on both a and b parameters. Both a and b parameters 
increase, on the average, and a parsieters become more 
variable. The correlations between c parameters obtained 
with and without overrepresentation of low ability are quite 
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Table 6.4 



Pearson Correlation Coefficients 
for mathematics c Parameters 



N3000A N3000B 



N3000S N1000 



N1000S 



N3000 



1.00 



N3000B 


(P=i 


99 
001) 


1. 


00 






N3000S 


{p=i 


39 

001) 


{p=i 


38 

001) 


1. 


00 


N1000 




6i 

001) 


(P=! 


55 

001) 


(p=! 


27 

001) 


N1000S 


{p=l 


18 

122) 


{p=l 


18 

117) 


{p=l 


70 

001) 



1. 00 



. 09 

{p=.437) 



1. 00 



Table 6.5 



Means an^ Standard Deviations (in parentheses) of 
Communications ICC Parameter Estimates* 



Sample 



N3000 


.831 


-2.090 




(.30?) 


(.837) 


N1000 


.841 


-2.102 




(.293) 


(.776) 


N500 


.855 


-2.076 




(.313) 


(.763) 


N250 


.911 


-lo825 




(.374) 


(.825) 



*c parameter statistics not Included because c parameters 
were fixed at 0.20 for all Communications LOGIST runs. 
Therefore the mean is .20 with a SD of zerr in all cases. 
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Table 6,6 

Pearson Correlation Coefficients 
for Communications a Parameters 



N3000 N1000 N500 N250 



N3000 1.00 

N1000 .90 1.00 

N500 .76 .81 1.00 

N250 .73 .71 .62 1,00 



All coefficients statistically sig-^.if icant (p=.001) 

Table 6.7 

Pearson Correlation Coefficients 
for Communications b Parameters 



N3000 N1000 N500 N250 



N300I0 1.00 

UlidB .96 1.00 

N500 .85 .86 1.00 

N250 .88 .84 .77 1.00 



All coefficients are statistically significant (p=. 301) 

low, below .40. Thusy if one accepts the results from 
representative sampling of N=3000 as being the most sounds 
oversampling results in unsatisfactory results. It is 
possible and plausible that the results from oversampling 
are the most sounds however, and that the results from 
representative sampling are inferior. Which is most sound 
seems best answered by studies using simulated data with 
known item parameters like those found for these items. 
Such a procedure was not included in this research. 

For mathematics, decreasing sample size without 
oversampling seems quite satisfactory down to Nal000, and 
even to N«750. The correlation among a parameter estimates 
(.88) is reasonably high, and the estimation of b parameters 
is quite satisfactory with Ns750. However, the standard 
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deviation of the b parameters starts to increase, and the 
correlation between a parameters to decrease (•84) at N-500, 
ca'!sing one to consider samples of that size too small for 
estimating parameters upon which to do such things as equate 
item difficulties for a pool of items. Since that is one of 
Ine main anticipated uses of item parameters, these results 
suggest that for mathematics such parameters be estimated on 
no fewer than 750 cases representing the target population. 

For communications, not only were c parameters rarely 
individually estim^table, but oversampling did not help, and 
when sample sizes became smaller than N=1000, correlations 
of both a and b parameters with their counterparts at N=3000 
were too low to be encouraging. Thus, for communications 
one would recommend larger experi mental g roups than for 
mathematics for obtaining item parameters, with Ns of 1000 
or more being appropriate. Presumably this increased sample 
size requirement is a function of the fact that the 
communications items are so very easy. 

For both communications and mathematics, these data 
lead us to the conclusion that estimating c parameters is 
probably not worth the trouble with these minimum competence 
data. Most of the c parameters will be set a common value 
which will be near the value at which one would fix them at 
in advance. Thus, little is to be gained by estimation. 
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CONCLUSIONS 

This set of studies was directed at answering some 
basic questions about the feasibility of using a three- 
parameter IRT model for the Florida Statewide Assessment 
Tests in place of the one-parameter model currently in use. 
Several basic questions were apparent. First, would the 
available computer program, LOGIST 4, converge when applied 
to these data which are so highly skewed due to the very 
easy items appro pr i ate fo r minimum competency tests? The 
answer to that question is yes. 

Second, we wondered whether the data were 
unidimensional as is assumed clearly by item response theory 
and implicitly by conventional theory. Againr the answer is 
in the affirmative. We see no indication that the data are 
sufficiently multidimensionc^l to pose a problem. 

Third, the data should theoretically fit a model more 
closely if the model contained more parameters. Indeed, 
these data fit a model which includes the discrimination 
parameter appreciably better than one which ignores that 
parameter. Including a guessing parameter, however, 
provided little gain in fit, perhaps because the items are 
so easy that there is little guessing and because estimation 
of guessing parameters is hazardous on such easy items. 

Fourth, we tried several approaches to improve the 
estimation of the guessing parameters of the items— at least 
to result in individual guessing parameters being estimated 
for larger numbers of items. These approaches were 
generally unsuccessful. For mathematics, at best guessing 
parameters were estimated for only one third of the items. 
For communications, rarely were ^jjessing parameters 
estimated for individual items. Thus, estimation and use of 
different guessing parameters for individual items is not 
really feasible for these tests uf ing LOGIST 4. Using a 
common guessing parameter for all ite^ns, Dased on the number 
of alternatives contained in the itCai, ^s he sound way to 
approach this aspect of use of the tnree* parameter model. 
It appears wiser to do this than to resort to a two- 
parameter model in which it is assumed that an individual 
has no chance of answering a multiple-choice item like these 
by guessing. It may be that LOGIST 5 or BILOG would be more 
successful in estimating guessing parameters, but to our 
knowledge no one has tried those procedures on data similar 
to these. 

A fifth question is whether modest sized samples, 500 
or so, as are currently used in obtaining estimates of item 
parameters for the one-parameter model in the Statewide 
Assessment Tests, can be used satisfactorily for estimating 
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item parameters in a more complex model. The answar seems 
to be negative. One might use as few as 750 for the 
mathematics test, but for the communications test we could 
not recommend using fewer than 1000. 

Finally, it has been suggested that better estimates of 
the parameters might be obtained by oversampling low 
abilities. Our analyses were not encouraging. A serious 
problem is the extreme skew of the data. Unless one uses a 
huge number of examinees, there simply are not enough cases 
at the low end of the scale to be very helpful. 
Particularly for the communications test, oversampling at 
the low end did not result in more individual c parameters 
being estimated. A mere satisfying approach to this problem 
might be followed, that of using synthetic dat-i patterned 
after the data of the Statewide Assessment Tests but with 
known parameters. What procedure best estimates known 
parameters remains to be determined. 

In sum, a more complex model is feasible. It will fit 
the data better. A modify' V three-parameter model would be 
recommended at this poi. using the difficulty and 
dis rimination parameters but setting the guessing 
parameters to a common value based on the number of 
alternatives. The samples used to estimate the parameters 
should be of approximately 1000 cases. 
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